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ABSTRACT 


This  thesis  presents  a  glo^  ,  scalable,  decentralized  group  membership  service  to 
manage  client  process  groups  operating  in  a  distributed,  asynchronous  environment.  This 
group  membership  service  is  totally  scalable,  handling  process  groups  spanning  a  single 
LAN  to  groups  spanning  the  entire  global  Internet  equally  well.  It  provides  for  nested  and 
overlapping  groups,  as  well  as  multiple  groups  residing  on  a  single  LAN.  It  also  provides 
various  Quality  of  Service  selections  which  permit  individual  groups  to  be  configured  for 
an  optimal  balance  between  high  quality  wi  ■  or  .  :x)nsistency  semantics  for  group 
membership,  and  weaker  consistency  senumtics  with .  duced  complexity  and  latency. 

This  thesis  describes  the  complete  design  of  the  protocol  used  to  implement  the  group 
membership  service.  It  presents  the  design  requirements  and  goals,  and  underlying 
assumptions  about  the  network.  The  various  Quality  of  Service  selections  provided  by  the 
group  membership  service  are  described  in  detail,  as  well  as  the  interface  between  the 
process  groups,  the  membership  service,  and  the  underlying  network.  The  use  of  a 
hierarchical  architecture  to  obtain  the  desired  scalability,  flexibility,  and  robustness  is 
explained.  A  proof  of  correctness  for  the  protocol  is  presented,  and  a  partial 
implementation  of  the  group  membership  service  is  described. 
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I.  INTRODUCTION 

A.  BACKGROUND 

Distributed  networks  of  computers  are  being  used  increasingly  to  provide 
computational  power  and  services  beyond  the  capabilities  of  a  single  computer  system. 
Distributed  application  programs  specifically  derigned  to  utilize  the  distributed  networks 
of  computers  are  gaining  wide  recognition  as  a  powerful,  flexible,  and  efficient  method  of 
performing  computation.  Often,  these  distributed  applications  can  be  logically  grouped  to 
allow  more  efficient  and  capable  interaction.  The  process  group  paradigm  has  been  shown 
to  be  particularly  well  suited  to  organizing  these  distributed  applications  into  a  single 
entity  working  toward  a  common  goal.  Examples  of  distributed  applications  that  can 
benefit  from  the  use  of  process  groups  include  multimedia  teleconferencing,  distributed 
system  management,  remote  monitoring  and  control  systems,  distributed  reliable 
databases,  banking  and  brokerage  services,  distributed  interactive  simulation  (DIS),  as 
well  as  a  multitude  of  other  applications.  These  process  groups  can  be  arranged  in  many 
possible  configurations  to  suit  the  needs  of  the  particular  application.  Examples  of  various 
process  group  configurations  are  shown  in  Figure  1 . 


Simple 
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Figure  1;  Process  Group  Configurations 
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Process  group  oriented  computation  based  on  reliable  communication  primitives  has 
been  shown  to  be  particularly  efTective  in  a  wide  variety  of  environments  [I,  2,  3,  4]  In 
this  paradigm,  a  group  may  correspond  to  a  set  of  processes  that  must  behave  consistently 
to  provide  a  service  or  make  a  decision.  Changes  in  the  membership  of  the  group  may 
occur  due  to  the  voluntary  arrival  and  departure  of  members,  or  failures  and  recoveries 
caused  by  the  dynamic  nature  of  the  membo's.  Therefore,  a  Membership  Service  (MS)  to 
manage  the  group  membership  is  a  fundamental  building  block  for  distributed  applications 
using  the  process  group  model. 

To  construct  usable  process  groups,  an  MS  must  first  overcome  the  group 
membership  problem  (GMP);  that  is,  providing  consistent  agreement  on  the  membership 
of  the  group  at  all  members  in  spite  of  dynamic  changes  to  the  group  [5].  This  problem  is 
compounded  by  the  asynchronous  nature  of  the  networks  upon  which  the  process  groups 
operate.  Additionally,  an  MS  must  be  scalable  to  support  groups  of  any  size  and 
distribution.  The  MS  must  be  efficient,  robust,  and  flexible  to  continue  to  provide  services 
to  the  client  process  groups  under  any  circumstances.  The  MS  must  provide  a  uniform 
interface  to  all  applications,  hiding  the  details  of  the  process  group  management  from  the 
users  of  the  MS.  An  illustration  of  the  logical  representation  of  the  MS  is  shown  in  Figure 


Figure  2;  Membership  Service  and  Application  Process  Groups 
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B.  SCOPE  AND  ORGANIZATION 

This  thesis  presents  the  design  of  a  globally  scalable,  decentralized  group  membership 
service  to  manage  application  process  groups  operating  in  a  distributed,  asynchronous 
environment.  The  scope  of  this  thesis  covers  an  investigation  into  current  group 
membership  protocols  and  membership  services;  the  identification  of  the  design 
requirements  for  an  MS;  the  design  of  a  hierarchical,  scalable  MS  that  meets  all  of  the 
design  goals;  the  detailed  specification  of  the  protocols  which  form  the  MS;  and  a  partial 
implementation  of  an  MS  running  on  a  campus  network. 

The  organization  of  this  thesis  is  as  follows.  The  first  chapter  provides  an 
introduction  to  the  needs  and  requirements  of  distributed  application  process  groups  and 
the  services  provided  by  a  membership  service.  The  second  chapter  describes  the 
necessary  and  useful  attributes  of  a  full-featured  MS,  followed  by  a  survey  of  current 
group  membership  protocols  and  membership  services.  The  third  chapter  provides  a 
detailed  description  of  the  hierarchical  architecture  and  components  of  the  MS.  The 
fourth  chapter  provides  a  detailed  description  of  the  five  protocols  required  to  implement 
the  MS,  including  algorithmic  psuedo-code  specifications  of  each.  The  fifth  chapter 
provides  a  proof  of  correctness  for  the  MS  protocols,  ensuring  that  the  MS  meets  the 
stated  design  requirements.  The  sixth  chapter  includes  an  implementation  of  a  set  of 
software  utilities  used  by  the  MS  and  a  partial  implementation  of  the  MS  protocols.  The 
final  chapter  provides  conclusions  about  the  design  of  the  MS  and  a  discussion  of  future 
work  to  be  completed. 
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II.  ATTRIBUTES  OF  A  MEMBERSHIP  SERVICE 


In  this  chapter  the  desirable  and  necessary  attributes  that  a  general  purpose  MS  must 
possess  are  described.  Design  goals  for  an  MS  which  has  all  of  the  required  attributes  are 
outlined.  The  network  and  user  interfaces  for  an  MS  are  defined  Finally,  a  survey  of 
current  group  membership  protocols  and  services  is  provided,  showing  the  need  for  a 
full-featured  MS. 

A.  DEFINITIONS  AND  ASSUMPTIONS 

Before  describing  the  attributes,  requirements,  and  features  of  an  MS,  the  operating 
environment  must  first  be  defined.  Certain  assumptions  about  the  functioning  of  the 
underlying  network  and  the  processes  which  comprise  the  MS  and  application  groups 
must  be  made.  These  assumptions  are  outlined  below. 

1.  The  Network 

Few  assumptions  about  the  service  provided  by  the  underlying  networks  and 
internetworks  are  made.  These  networks  are  assumed  to  be  asynchronous  and  unreliable, 
with  only  connectionless,  "best  effort"  datagram  delivery  provided,  with  unbounded 
delivery  time.  Messages  may  be  lost,  delayed,  duplicated,  garbled,  or  arrive  out  of  order. 
Furthermore,  the  networks  may  suffer  partitions,  leading  to  the  interruption  in 
communications  between  end  stations  for  variable  periods  of  time.  It  is  assumed  that  a 
network  multicast  capability  is  provided,  such  as  IP  multicast  [6,  7,  8].  This  multicast 
capability  is  assumed  to  provide  rudimentary  group  management  for  the  set  of  hosts  which 
share  a  common  multicast  address,  including  the  creation  and  maintenance  of  a  multicast 
routing  tree,  and  the  detection  and  removal  of  processes  which  are  not  responding. 
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2.  The  Processes 

Computer  processes  executing  on  distributed  host  computers  throughout  the 
network  are  the  entities  which  form  the  MS  as  well  as  the  application  process  groups 
which  use  the  MS.  It  is  assumed  that  the  host  computers  and  processes  running  on  them 
are  unreliable  and  may  fail  at  any  time.  The  failure  of  the  host  computer  or  the  process 
running  on  the  computer  are  indistinguishable  from  the  perspective  of  the  MS.  It  is 
assumed  that  these  failures  will  be  fail-stop,  or  crashes  [S,  9,  10,  11].  The  computers  or 
processes  will  simply  cease  to  function,  with  no  malicious  behavior 

The  exchange  of  messages  is  the  only  way  that  distributed  processes  can  learn  of 
each  other's  status.  Due  to  the  unreliable  nature  of  the  network  described  above,  these 
messages  may  never  reach  their  destination,  even  though  both  sender  and  receiver  are 
functioTung  normally.  For  this  reason,  it  is  impossible  for  distributed  processes  to 
distinguish  between  network  partitions  and  the  actual  failure  of  other  processes  [S,  9,  10, 
11].  Therefore,  the  failure  of  another  process  can  only  be  perceived,  never  known  for 
sure.  Perceived  failures  are  detected  by  the  lack  of  response  within  a  timeout  period. 
Although  these  perceived  failures  may  be  caused  by  a  partition  of  the  network  or  the 
actual  failure  of  the  process,  they  will  be  treated  as  if  the  process  had  actually  failed. 

B.  DESIRABLE  ATTRIBUTES  OF  A  MEMBERSHIP  SERVICE 

A  membership  service  must  provide  a  suite  of  services  to  manage  group-oriented 
applications.  Some  of  these  services  are  explicitly  invoked,  such  as  calls  to  create  new 
process  groups,  to  have  processes  join  or  depart  the  process  group,  or  to  split  or  merge 
the  process  group.  Other  services  are  implicitly  and  automatically  provided  by  the  MS, 
such  as  detecting  and  processing  member  &ilures  within  the  group,  detecting  and 
processing  partitions  of  the  network,  ensuring  unique  group  names  within  a  given  scope 
and  providing  consistency  of  ordering  of  group  membership  changes  at  all  members.  Still 
other  services  provide  information  to  applications  upon  request,  such  as  group  name,  size. 
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membership,  view  number,  and  automatic  notification  of  group  membership  changes.  A 
membership  service  also  has  certain  inherent  attributes,  such  as  scalability,  fault-tolerance, 
efficiency  and  flexibility.  Table  1  lists  several  desirable  attributes  that  a  general  purpose 
MS  should  posses  to  fully  support  application  process  groups. 


TABLE  1:  DESIRABLE  ATTRIBUTES  OF  A  MEMBERSHIP  SERVICE 


Attribute 

Interpretation 

Significance 

A:  Adaptive  status 
monitor 

Adjust  timeouts  bas^  on  local 
conditions 

Minimize  wrongly  perceived 
failures 

H:  Hierarchical 
protocol 

Multilevel  membership 
maintenance 

Exploit  hierarchy  in  WANs, 
support  very  large  groups 

L:  scaLability  to 
large  groups 

Absence  of  centralized  actions 
in  the  protocol 

Support  of  large,  extensively 
overlapped  groups 

M:  Multiple  network 
support 

Distribution  over 
heterogeneous  networks 

Novel  applications 

N:  Non  blocking 
reconfiguration 

Processing  of  continuous  status 
changes 

Enhanced  performance  for 
highly  dynamic  groups 

O:  topology-based 
Optimization 

Use  of  physical  topology  and 
LAN  features 

Support  of  widely  distributed 
groups 

P:  network 
Partitioning 

Merging  after  recovery  with 
required  consistency 

Increased  applicability  of 
Membership  Service 

R:  Real-time 
service 

Guaranteed  detection  and 
processing  latency  for  changes 

Support  real-time  applications 

S:  multiple  Simul¬ 
taneous  changes 

Quick  update  with  weaker 
consistency 

Multiple  classes  of  service  with 
overhead  proportional  to 
quality 

X:  flexible  member¬ 
ship  semantics 

Availability  of  a  range  of 
consistency  semantics 
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It  should  be  noted  that  some  of  these  de»rabie  attributes  listed  in  Table  1  conflict 
with  each  other  For  example,  adjustment  of  timeouts  based  on  local  conditions  will 
violate  the  real-time  aspect  of  the  MS.  Non-blocking  reconfiguration  and  merging  after 
partitions  conflict  with  providing  strongly  ordered  membership  change  semantics  Thus,  a 
fully-featured  MS  must  permit  the  membership  service  user  (MSU)  to  choose  which  of 
these  conflicting  desirable  attributes  will  have  priority.  The  MSU  is  given  the  option  of 
choosing  various  Quality  of  Service  (QoS)  selections  to  configure  the  MS  to  the  exact 
needs  of  the  application. 

C.  MEMBERSHIP  SERVICE  DESIGN  GOALS 

In  this  section  the  design  goals  of  a  full-featured  MS  are  described. 

1.  Scalability 

The  MS  must  be  completely  scalable.  Application  process  groups  spanning  a 
single  local  area  network  (LAN)  or  the  worldwide  Internet  will  see  the  same  level  of 
service.  To  accomplish  this  goal,  the  membership  information  for  all  groups  must  be 
maintained  hierarchically.  Information  about  process  groups  will  be  distributed 
throughout  the  hierarchy,  so  that  each  node  need  only  store  and  process  information  for 
the  application  groups  that  it  supports  directly  below  it.  In  this  manner,  the  MS  nodes  that 
have  no  member  processes  for  a  particular  application  are  in  no  way  impacted  by  the 
processing  of  membership  changes  for  this  application.  Additionally,  the  MS  will  use  a 
decentralized,  hierarchical  decision  making  scheme,  since  a  centralized  scheme  is  not 
scalable.  The  decisions  about  membership  changes  to  application  groups  will  be  made  by 
a  set  of  distributed  nodes  located  in  the  hierarchy,  which  will  then  propagate  the  decision 
to  all  process  group  members.  By  using  the  hierarchical  nature  of  the  MS,  the  number  of 
nodes  involved  in  each  membership  change  decision  will  be  small.  Additionally,  the  level 
of  the  set  of  nodes  in  the  hierarchy  will  be  different  for  most  process  groups,  since  the 
span  of  most  groups  will  be  different.  Thus,  different  parts  of  the  hierarchy  can  function 
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concurrently,  processing  membership  change  decisions  for  different  groups  at  different 
levels  without  affecting  the  operation  of  the  other  parts  of  the  hierarchy. 

2.  Efficiency 

The  MS  must  be  efficient  in  the  use  of  computational  and  network  resources  in 
order  to  be  scalable.  Using  the  decentralized  hierarchical  structure,  each  node  in  the 
hierarchy  need  only  process  and  store  a  small  part  of  the  information  needed  to  support  ail 
application  process  groups. 

Since  hosts  computers  attach  to  the  internetwork  through  a  LAN,  access  to  the 
MS  must  be  present  at  the  LAN  level  at  all  times,  even  if  there  are  no  groups  present  on  a 
particular  LAN.  This  will  drastically  reduce  the  latency  for  creating  new  groups  and 
permits  the  use  of  special  LAN-level  features  such  as  hardware  multicast.  To  provide  this 
continual  access,  a  daemon  process  should  be  running  on  each  MS  capable  host  computer, 
and  an  MS  node  should  be  running  on  a  dedicated  server  for  the  LAN.  The  daemon 
process  provides  an  interface  between  the  MSU  and  the  MS. 

Multicast  messages  must  be  used  to  process  all  changes,  since  multicasting  is  an 
extremely  efficient  method  for  multiple  processes  to  communicate.  Additionally,  the  use 
of  a  hierarchy  provides  a  natural  tunneling  effect  for  multiple  messages  propagating  to 
higher  levels  in  the  hierarchy.  This  is  a  form  of  concast  [12],  reducing  the  load  on  the 
network  at  each  level  in  the  hierarchy. 

3.  Resilience  to  Failures  and  Partitions 

The  MS  must  provide  membership  semantics  that  handle  foilures  of  members  as 
well  as  the  underlying  network.  Failures  of  either  members  or  the  network  must  be 
automatically  detected  and  processed,  reforming  the  group  without  any  direct  intervention 
by  the  application  processes  or  the  MSU.  The  MS  must  use  a  decentralized  protocol  to 
eliminate  any  single  point  of  failure.  Multiple  simultaneous  failures  of  member  processes 


8 


must  be  detected  and  processed  without  blocking,  usually  by  "batching"  the  failures  into  a 
single  change  to  the  membership. 

4.  Levels  of  Consistency 

As  identified  in  other  group  membership  protocols  [11,  13,  14],  there  are  various 
possible  levels  of  consistency  in  the  ordering  of  changes  to  the  membership  view  at 
members  of  a  process  group.  Strong  consistency  guarantees  that  all  members  see  exactly 
the  same  changes  to  the  group  membership  in  exactly  the  same  order.  Weak  consistency 
guarantees  that  all  group  members  will  eventually  reach  the  same  view  of  the  group 
membership,  but  may  hold  disparate  views  for  some  period  of  time.  Strong  consistency 
requires  added  complexity  and  overhead  to  ensure  that  all  members  have  the  same 
ordering  of  membership  changes,  while  weak  consistency  relaxes  the  requirements 
required  by  strong  consistency,  and  therefore  is  less  complex  Strong  consistency  must 
block  all  changes  to  the  membership  until  the  current  change  finishes,  while  weak 
consistency  may  process  concurrent  changes.  Thus,  weak  consistency  generally  has  a 
reduced  latency  over  strong  consistency.  The  MS  must  provide  flexible  membership 
semantics  for  the  application  groups  supported,  allowing  the  MSU  to  select  the  level  of 
consistency  needed  for  the  particular  application. 

5.  Membership  and  Name  Scope  Control 

The  MS  must  provide  a  means  to  limit  the  extent  of  individual  application 
groups.  Without  such  a  limit,  all  application  groups  could  potentially  use  the  whole  MS 
hierarchy,  even  if  only  a  small  part  of  the  hierm'chy  was  actually  needed,  creating  a 
bottleneck  at  the  highest  level  in  the  hierarchy.  The  use  of  "scope  control"  parameters 
limits  the  maximum  span  of  an  application  group  in  the  MS  hierarchy  to  the  referenced 
level.  Membership  scope  control  limits  the  extent  of  group  name  searches  whenever  an 
application  group  is  referenced,  such  as  a  request  to  create  a  new  group  or  join  an  existing 
group.  Name  scope  control  limits  the  maximum  span  in  the  MS  hierarchy  which  an 
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application  group  can  cover  This  parameter  can  be  used  when  the  application  group  is 
created,  and  specifies  the  highest  level  in  the  MS  hierarchy  at  which  the  group  name 
should  be  registered  References  to  an  application  group  outside  of  the  name  scope  will 
not  find  the  application  group,  and  will  propagate  to  the  highest  level  in  the  MS  hierarchy 
unless  limited  by  the  membership  scope  control  parameter 

6.  Selectable  Quality  of  Service 

The  MS  must  permit  user  selection  of  the  conflicting  desirable  attributes 
identified  in  Table  1.  Some  of  the  QoS  selections  which  must  be  supported  include:  the 
level  of  consistency  in  ordering  of  membership  changes,  methods  of  resolving  partitions  in 
application  groups,  adaptive  status  monitor  conditions  to  adjust  the  MS  for  local 
conditions,  designation  of  a  limited  scope  for  the  application  group,  and  user  configuration 
of  the  MS  hierarchy  for  special  purpose  applications.  An  MSU  must  be  able  to  select  the 
desired  level  of  service  by  specifying  certain  parameters  related  to  the  QoS.  These 
parameters  specify  how  application  group  partitions  are  handled,  how  the  scope  of  a 
group  name  is  controlled,  how  the  membership  change  information  is  ordered,  the  setting 
of  the  failure  detection  timeouts,  and  the  aggregation  of  multiple  simultaneous  changes. 

D.  MEMBERSHIP  SERVICE  INTERFACES 

In  this  section  the  relation  of  the  MS  protocols  is  defined  with  respect  to  the  Internet 
Protocol  (IP)  protocol  stack,  which  is  the  de-facto  standard  for  internetworking 
communications.  Additionally,  the  application  user’s  interface  and  the  MS  system 
configuration  interface  are  described 

1.  Network  Protocol  Layers 

Figure  3  illustrates  the  relation  of  the  MS  protocols  to  the  Transmission  Control 
Protocol/  Internet  Protocol  (TCP/IP)  suite  of  internetworking  protocols  in  the  layer 
below,  and  the  application  programs  and  upper-layer  protocol  modules  in  the  layer  above. 
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using  the  common  layering  model  of  depicting  the  hierarchical  dependencies  of  network 
protocols 


Application  and  Upper-Layer 
Protocol  Modules 


Membership  Service  Interface 


Member  Interface  (1 
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Membership  Service 

Module 

mserver 

Transport  Service  Interface 


DP  Multicast 

UDP 
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IP  Service  Interface 


IP  Module 


ICMP 


IGMP 


Figure  3:  Protocol  Layers 


2.  User  Interface 

The  application  user's  interface  to  the  MS  is  provided  through  explicit  system 
calls  to  alter  the  membership  or  provide  information  about  application  process  groups. 
The  MS  is  implicitly  called  to  change  the  .-nembership  of  application  groups  any  time  a 
process  failure  or  network  partition  occur.  The  following  lists  explain  these  system  calls 
and  events  in  more  detail. 


a.  Informational 

1  Group  View  ("group") 

Provide  the  current  group  view  number  maintained  by  the  MS  Used  by 
application  processes  to  guarantee  all  members  have  the  most  recent  view  of  the  group 
membership. 

2.  Group  Statistics  ("group") 

Provide  current  group  view  number  and  membership  list  maintained  by 

the  MS. 

b.  Explicit  Membership  System  Calls 

1 .  Join  ("group",  membership  scope,  name_scope) 

Request  by  a  new  member  process  to  join  a  group  which  may  or  may 
not  already  exist.  If  the  group  does  not  presently  exist,  a  new  group  is  formed  with  only 
this  member.  If  the  group  does  exist  within  the  requested  scope,  the  MS  processes  the 
change  and  informs  the  application  group  of  the  addition.  The  membership_scope  field  is 
used  to  specify  the  highest  level  in  the  MS  hierarchy  which  should  be  searched  for  the 
application  group  name  during  a  join,  thus  limiting  the  time  required  to  determine  if  the 
group  exists,  and  the  impact  on  other  groups.  The  name  scope  field  is  used  during  the 
creation  of  the  process  group  to  specify  the  maximum  span  the  application  group  will  ever 
cover  in  the  MS  hierarchy.  This  field  limits  the  extent  of  the  search  recjuired  whenever  a 
group  is  referenced. 

2.  Leave  ("group",  gid,  membership  scope) 

Request  by  member  with  group  identity  number  "gid"  to  leave  a  group. 
The  departing  member  is  able  to  leave  immediately,  without  waiting  for  a  response  from 
the  MS 

3 .  Merge  ("group  1 ",  "group2") 
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Request  by  member  of  group  1  to  merge  group  1  and  group2.  Upon 
successful  completion,  the  union  of  the  two  groups  will  be  formed,  using  group  name 
"group  I ",  with  a  new  group  view.  This  request  is  the  general  form  of  the  join  request. 

4.  Split  ("group  1 ",  "group2",  g2MemberList) 

Request  by  a  member  of  group  1  to  remove  one  or  members  of  group  I, 
listed  in  the  parameter  g2MemberList,  and  form  a  new  group2  with  these  members.  This 
request  is  the  general  form  of  the  leave  request 

c  Implicit  Membership  Altering  Events 
1 .  Failures  and  Partitions 

The  MS  will  automatically  handle  perceived  failures  of  group  members, 
up  to  and  including  all  members.  Automatic  notification  of  member  failures  is  provided  to 
the  application  group. 

3.  System  Configuration  Interface 

The  system  calls  used  to  configure  the  MS  hierarchy  are  virtually  the  same  as 
those  used  by  application  groups,  with  the  exception  of  calls  to  make  certain  nodes  parent 
nodes  of  others,  thus  creating  the  hierarchy.  The  configuration  of  the  MS  is  performed  by 
a  system  administrator,  using  individual  command  line  system  calls  or  an  MS  configuration 
program  called  MS  mgr. 

E.  CURRENT  PROTOCOLS 

A  summary  of  existing  membership  protocols  is  provided  in  Table  2.  The  category 
headings  are  the  same  desirable  attributes  of  a  membership  listed  in  Table  1.  Fiiudly,  a 
listing  of  the  design  goals  and  desirable  attributes  contained  in  the  MS  presented  in  this 
thesis  is  shown  in  Table  3  for  comparison. 

Unlike  any  known  group  membership  protocol,  the  group  membership  service 
described  in  this  thesis  is  totally  scalable,  handling  process  groups  spanning  a  single  LAN 
to  groups  spanning  the  entire  global  Internet  equally  well.  It  provides  for  nested  and 
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TABLE  2  A  SUMMARY  OF  EXISTING  MEMBERSHIP  PROTOCOLS 
Index  to  Cohimns:  see  Table  1. 


Index  to  Entries:  ^ ;  Supported,  X  :  Not  supported, 

E ;  Support  possiUe  with  extensions,  -  :  unknown 


Protocol 

Required 

Principle 

□ 

H 

D 

M 

N 

0 

D 

□ 

S 

Q 

Network 

Properties 

Feature 

1 

1 

1 

1 

1 

Asynchronous  Environment: 


Chang  er  al. 

115i 

unreliable 

message 

token  site 

X 

X 

X 

X 

X 

X 

X 

X 

X 

message 

difiiision 

version  numbers, 
stable  storage 

E 

X 

X 

X 

X 

X 

X 

X 

E 

X 

El  Abbadi  et  al. 
1171 

unreliable 

message 

virtual  partitions 

E 

1 

✓ 

✓ 

X 

E 

✓ 

X 

X 

Verissimo  et  al. 
(18| 

broadcast 

LAN 

two-phase  accept 

X 

X 

X 

X 

X 

X 

X 

✓ 

E 

X 

Moser  et  al. 

[191 

ordered. 

reliable 

ordinal  numbers 

X 

X 

X 

X 

X 

E 

X 

X 

X 

X 

Riccardi  et  al. 

m 

uiureliable 

message 

reconfiguration 

manager 

E 

E 

E 

E 

E 

X 

X 

X 

X 

Mishra  et  al. 

1201 

ordered, 

reliable 

Psync& 

conversations 

X 

X 

E 

E 

X 

E 

B 

X 

E 

Auerbach  et  al. 
1211 

multicast 

hardware 

multicast 

sequences 

X 

E 

X 

X 

i 

✓ 

X 

✓ 

X 

Jahanian  et  al. 
1131 

uiueliable 

message 

crown  prince 

E 

E 

E 

E 

B 

X 

E 

X 

y 

Golding  et  al. 
(221 

unreliable 

message 

time-stamped 

anti-entropy 

E 

X 

X 

_ 

X 

S3nichronous  Environment: 


bounded  delay 

attendance 

lists 

X 

X 

X 

X 

"7 

X 

X 

~7 

E 

X 

bounded  delay 

time-domain 

multiplexing 

X 

X 

X 

X 

X 

X 

X 

TDMAbus 

reception 

history 

X 

X 

X 

X 

X 

X 

X 

Rodrigues  et  al. 
1251 

exposed  LAN 
interfKe 

transmit-with- 

reqionse 

X 

X 

X 

X 

X 

X 

X 

overlapping  groups,  as  well  as  multiple  groups  residing  on  a  single  LAN.  It  also  provides 
various  Quality  of  Service  selections  which  permit  individual  groups  to  be  configured  for 
an  optimal  balance  between  high  quality  with  strong  consistency  semantics  for  group 
membership,  with  the  associated  complexity  and  latency,  and  weaker  consistency 
semantics  with  reduced  complexity  and  latency. 


TABLE  3:  ATTRIBUTES  OF  THE  MEMBERSHIP  SERVICE 


Required  Network 
Properties 

Principle  Feature 

i 

H 

H 

M 

N 

O 

1 

i 

S 

X 

Unreliable 

messages 

Decentralized 
protocol  based  on 
ordered  membership 

✓ 

✓ 

~ 

X 

Bounded  delay 
message  delivery 

X 

X 

y/' 
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III.  MEMBERSHIP  SERVICE  ARCHITECTURE 


At  the  foundation  of  the  scalable  and  efficient  Membership  Service  lies  the 
architectural  structure.  The  key  to  a  scalable  Membership  Service  is  a  decentralized, 
hierarchical  architecture.  The  Membership  Service  uses  a  hierarchical  architecture 
designed  to  follow  the  pre-existing  physical  topology  of  the  subnetworks,  networks,  and 
internetworks  upon  which  the  distributed  application  process  groups  that  the  Membership 
Service  supports  will  be  running.  This  chapter  describes  the  structure  and  composition  of 
the  physical  hierarchy  of  the  MS  and  how  this  architecture  supports  application  process 
groups. 

A.  PHYSICAL  HIERARCHY 

The  relevance  of  the  architecture  of  the  MS  to  the  scalability  of  the  MS  is  obvious 
when  the  global  scale  is  considered.  There  are  presently  over  120  million  computers  and  I 
million  LANs  world-wide,  connected  by  bridges  and  routers  to  form  global  internetworks. 
A  centralized  MS  would  require  the  central  node  to  interact  directly  with  all  of  these 
computers  distributed  throughout  the  world,  clearly  an  impossibility.  By  forming  a  logical 
hierarchy,  the  interaction  required  by  each  node  in  the  hierarchical  tree  is  limited  to  those 
nodes  directly  above  and  below,  providing  a  uniform  load  for  any  node  in  the  hierarchy. 
The  significance  of  the  hierarchical  structure  is  illustrated  in  Figure  4,  where  an  n-ary 
hierarchical  tree  is  formed  with  eight  levels  of  ten  nodes  each,  providing  support  for 
virtually  all  of  the  world's  computers  at  the  leaf  level.  With  this  hierarchy  it  is  possible  for 
any  leaf  computer  to  communicate  with  the  root  level  of  the  tree  with  only  six 
intermediate  relays  by  nodes  in  the  tree.  If  these  intermediate  nodes  are  logically 
connected  in  a  manner  which  closely  follows  their  physical  connectivity,  the  connection 
from  leaf  to  root  level  could  require  as  few  as  six  physical  communication  links. 
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The  other  significant  aspect  of  the  hierarchical  structure  is  that  a  node  at  each  level  of 
the  tree  only  need  interface  with  the  parent  node  above  and  the  children  nodes  bdow.  In 
Figure  4,  each  node  communicates  directly  with  only  ten  children  nodes  and  one  parent 
node.  This  is  in  comparison  to  the  interaction  in  a  centralized  MS,  where  a  single  manager 
node  must  communicate  directly  with  all  other  nodes  in  the  MS  -  potentially  millions  of 
nodes  managed  by  a  single  manager. 

1.  Mservers  and  Member  Interfaces 

The  MS  is  comprised  of  two  primary  entities;  Membership  Servers  (mservm) 
and  Member  Inter&ces  (MI).  The  mservers  are  the  heart  of  the  MS,  forming  the  nodes  of 
the  hierarchy.  The  mservers  are  processes  running  on  routers  or  host  computers 
distributed  throughout  the  internetwork.  The  mservers  provide  connectivity,  routing,  and 
record-keeping  functions  in  a  distributed,  decentralized  manner  for  the  MS.  The  mservers 
are  primarily  responsible  for  processing  changes  and  providing  information  to  the 
members  of  both  the  physical  hierarchy  as  wdl  as  the  application  process  groups  using  the 
MS.  Typically,  one  mserver  process  runs  on  each  router  or  name  server  in  the  network, 
and  one  mserver  runs  on  a  dedicated  host  or  the  designated  router  for  each  connected 
LAN.  Application  group  processes  interface  v^ath  the  MS  through  an  MI  process  running 
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on  each  host  computer  Each  Ml  accepts  requests  for  changes  to  or  information  about 
application  groups  from  the  individual  application  member  processes  running  on  the 
particular  host  computer  The  MI  then  reliably  relays  these  requests  to  the  LAN  mserver 
for  submission  to  the  MS.  The  MI  receives  responses  from  the  LAN  mserver  and  reliably 
propagates  these  responses  to  the  application  member  processes  that  it  supports.  Each  MI 
is  able  to  support  numerous  application  groups  and  numerous  individual  member 
processes  from  each  application  group,  limited  only  by  the  available  resources  of  the 
individual  host  computer. 

2.  Organization  and  Configuration 
a  Logical  Hierarchy 

The  physical  hierarchy  of  the  MS  is  formed  with  mserver  nodes  logically 
connected  together  to  form  an  n-ary  tree.  The  Mis  are  located  at  the  leaf  level  of  the 
physical  tree,  at  the  host  computer  level,  providing  an  immediate  interface  for  the 
application  group  processes  running  on  the  host  computer.  Figure  S  illustrates  an  example 
lo^cal  hierarchy  of  mservers.  Mis.  and  «q>plication  group  processes.  The  architecture 
shown  is  a  representative  configuration  for  a  small  area  encompassing  a  single  institution, 
such  as  a  campus  or  business  In  this  case,  the  architecture  shown  is  the  configuration  of 
the  Naval  Postgraduate  School  (NPS),  where  the  MS  is  under  development.  In  Figure  S, 
the  set  of  mservers  labeled  NPS  are  servers  at  the  root  level,  attached  to  the  campus 
backbone,  representing  the  whole  campus.  At  the  next  lower  level  are  sets  of  mservers 
representing  individual  buildings  at  the  campus,  labeled  Spanagel,  Root,  and  Ingersoll. 
Each  of  the  mservers  in  these  sets  are  servers  on  LANs  located  in  the  buildings.  At  the 
next  lower  level  are  the  Mis  running  on  individual  host  computers  on  each  LAN.  The 
LANs  are  labeled  as  ECEI,  ECE2,  SPI,  and  so  on.  Below  the  Mis  are  the  application 
group  processes  running  on  each  host  computer.  In  this  example,  there  are  four 
application  groups  shown.  Some  Mis  are  shown  supporting  more  than  one  group,  each 
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with  one  or  more  members  per  host,  while  other  Mis  have  no  application  groups  to 
support.  The  MI  process  r«nains  resident  on  the  host  computer  even  if  no  applications 
are  running  to  provide  quick  access  to  the  MS. 
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Figure  5:  Logiod  MS  Hierarchy 
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b.  Physical  Topt^ogy 

The  logical  hierarchy  shown  in  Figure  5  corresponds  to  the  physical 
topology  of  networks  and  computers  shown  in  Figure  6.  In  this  illustration,  each 
successively  larger  grouping  of  computers  and  networks  is  indicated  by  dotted  lines  and 
the  associated  name,  corresponding  to  the  sets  of  MI  or  mservers  shown  at  each  level  in 
Figure  5. 
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NFS 


Figure  6;  Physical  MS  Hierarchy 
c.  Semi-static  Coi^guratioH 

The  mservers  and  Mis  of  the  MS  are  manually  configured  into  the  desired 
physical  hierarchy  by  a  local  system  administrator  or  cognizant  authority.  This 
configuration  is  expected  to  be  semi-static,  normally  changing  only  when  additions  and 
deletions  to  the  networks  maintained  by  the  administrator  are  made.  The  system 
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administrator  will  assign  appropriate  names  for  atch  set  of  mservers  at  each  level, 
corresponding  to  the  multicast  group  which  connects  the  set  of  mservers.  The  assignment 
of  a  set  name  and  multicast  address  are  accomplished  when  the  set  of  mservers  are  created 
and  joined  together,  using  software  calls  to  the  MS. 

d.  Failures,  Partitions,  and  Dynamic  Reformation 

Although  the  mserver  and  MI  configuration  is  not  expected  to  change  very 
often,  there  is  still  a  possibility  of  the  failure  of  the  mserver  or  MI  processes,  the  host 
computers  or  servers  upon  which  they  are  running,  or  partitions  in  the  network.  These 
failures  and  partitions  lead  to  a  dynamic  reconfiguration  of  the  physical  structure  of  the 
MS,  with  the  surviving  mservers  and  Mis  automatically  reforming  into  partitioned  sets. 
Since  any  failure  or  perceived  failure  of  an  mserver  is  actually  a  virtual  partition  of  the 
network,  all  failures  and  partitions  will  lead  to  the  creation  of  one  or  more  partitioned 
subsets  of  the  original  set  of  mservers.  Each  partitioned  subset  of  mservers  will 
correspond  to  that  subtree  of  the  physical  hierarchy  on  one  "side"  of  the  partition;  that  is, 
all  of  the  mservers  which  are  still  able  to  communicate  over  the  non-partitioned  network. 
Each  reformed  physical  hierarchy  of  the  MS  will  continue  to  function,  providing  service  to 
all  application  process  groups  with  members  still  existing  in  the  partition.  The  application 
process  groups  which  span  the  partitioned  netwoiic  will  also  experience  a  partition  in  their 
membership.  This  condition  will  continue  until  the  physical  network  partition  is  repaired, 
at  which  time  the  physical  hierarchy  of  mservers  will  either  manually  or  automatically  be 
reformed  to  the  original  configuration.  Once  the  physical  hierarchy  is  restored,  the 
surviving  application  groups  will  also  be  reformed,  if  this  is  the  QoS  related  to  partition 
resolution  chosen  by  the  application  user  at  start  up  time. 

In  addition  to  the  overall  hierarchical  structure  of  the  MS,  each  set  of 
mservers  in  the  physical  hierarchy  is  also  organized  into  a  monitoring-set  and 
change-processing  core-set.  The  LAN  mservers  also  are  responsible  for  monitoring  the 
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status  of  all  Mis  on  the  LAN  These  organizations  of  mservers  will  be  explained  in  tlw 
next  sections. 

3.  Monitoring  Set 

a.  Dilution  and  Purpose 

The  first  criterion  for  an  MS  to  be  dynamically  reconfigurable  is  to  be  able  to 
detect  fiulures  of  the  component  entities.  To  accomplish  this,  each  set  of  mservers  in  the 
physical  hierarchy  is  organized  into  a  monitoring-set.  The  purpose  of  this  monitoring-set 
is  to  detect  and  announce  the  failure  of  any  failed  or  perceived  ^ed  mserver  in  the  set. 
The  detection  method  used  is  pairwise,  peer-to-peer  monitoring  of  the  mservers  in  the 
monitoring-set.  Each  mserver  is  responsible  for  monitoring  one  other  mserver  in  the  set, 
and  in  turn  is  monitored  by  one  other  mserver.  The  monitoring  is  accomplished  by  the 
monitor  sending  periodic  Query  messages  to  the  monitored  mserver,  which  then  responds 
with  a  Reply  message,  indicating  normal  status. 

b.  Structure 

An  illustration  of  a  monitoring-set  is  shown  in  Figure  7.  The  pairs  of 
monitoring  and  monitored  mservers  are  determined  by  the  order  in  which  the  msovers 
join  the  monitoring-set.  Each  newly  joining  mserver  is  connected  into  the  pair-wise 
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monitoring  sequence  as  the  mserver  monitored  by  the  highest  rank  (oldest)  mserver  in  the 
set,  and  will  begin  monitoring  the  previously  lowest  rank  (youngest)  mserver. 

c.  Failure  Detection 

As  with  nearly  every  message  sent  within  the  MS,  the  monitor  will  set  a 
timer  upon  sending  the  Query  message.  If  a  Reply  message  is  not  received  before  the 
timer  expires,  the  monitor  will  suspect  the  monitored  mserver  of  failure.  One  or  more 
retries  will  be  conducted,  and  if  the  monitored  mserver  does  not  respond  in  this  time,  it 
will  be  declared  failed  by  the  monitor,  which  will  then  announce  the  failure  to  all  other 
mservers  in  the  set.  The  mserver  detected  MIed  may  have  actually  failed,  or  may  be 
unable  to  communicate  with  the  monitor;  in  either  case,  it  will  be  considered  failed  by  all 
mservers  which  receive  the  monitor's  announcement. 

4.  Change>proccaaing  Core-set 

a  Dt^initkm 

A  second  organization  applied  to  the  set  of  mservers  at  each  level  in  the 
hierarchy  is  that  of  a  change-processing  core-set.  This  set  of  mservers  is  responsible  for 
processing  all  membership  change  requests  submitted  by  the  application  process  groups 
that  it  supports,  as  well  as  enacting  changes  in  the  physical  hierarchy.  The  change 
processing  involves  reaching  a  consistent  agreement  amongst  all  core-set  mservers  about 
the  change  being  submitted,  then  to  reliably  propagate  this  change  back  to  the  application 
process  members,  who  are  then  guaranteed  to  have  a  consistent  view  of  the  changed 
application  group  membership. 

6.  Purpose 

This  organization  is  termed  a  core-set  because  it  is  the  small  set  of  mservers 
at  that  "top"  level  for  the  group  in  the  physical  hierarchy  which  connects  the  particular 
application  process  group  supported.  For  example,  the  set  of  mservers  labeled  NFS  in 
Figure  5  serve  as  the  core-set  for  all  four  application  groups,  since  each  application  group 
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has  members  distributed  on  all  LANs  The  sets  of  mservers  at  lower  levels  in  the 
hierarchy  will  not  process  these  application  membership  changes,  but  will  submit  them  to 
the  core-set,  then  relay  the  results  back  to  the  Mis  In  this  manner,  the  hierarchical 
structure  of  the  MS  is  used  to  reduce  the  number  of  mservers  that  cooperate  to  process  a 
membership  change  for  an  application  group  to  those  in  the  core-set  for  that  group  This 
organization  leads  to  very  efficient  and  fast  processing  of  membership  changes  for  groups 
of  any  size  and  distribution,  since  only  the  core-set  of  mservers  will  be  processing  the 
change.  It  also  provides  the  necessary  scalability  for  the  MS,  since  application  process 
groups  of  any  size  or  distribution  will  have  a  small  core-set  of  mservers  processing  the 
membership  changes,  and  thus  will  experience  nearly  the  same  small  processing  time.  The 
primary  difference  in  membership  change  processing  times  for  different  application  groups 
will  be  caused  by  the  level  of  the  core-set  in  the  physical  hierarchy  A  core-set  at  a  higher 
level  will  have  more  intermediate  relaying  mservers  between  it  and  the  application  member 
processes,  thus  creating  a  longer  transmission  path. 


parent 


Figure  8  .  Change-processing  Core-set  of  Mservers 
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c.  Structure 


An  illustration  of  a  core-set  of  mservers  is  shown  in  Figure  8  The  mservers 
in  the  core-set  are  connected  in  a  multicast  tree,  using  a  common  multicast  group  to 
multicast  a  change  message  from  one  mserver  to  ail  others  at  once.  For  each  membership 
change  request  submitted  to  the  core-set,  a  coordinator  is  chosen  The  criteria  for 
selecting  the  coordinator  depends  on  the  particular  type  of  change  and  how  it  was 
submitted  to  or  detected  by  the  core-set  The  fact  that  the  coordinator  is  not  a  fixed 
member  of  the  core-set,  but  instead  varies  from  change  to  change,  is  a  powerful  feature  of 
the  MS  Since  the  coordinator  does  not  exist  as  such  unless  a  change  is  actively  being 
processed,  there  is  no  need  to  ensure  an  operational  coordinator  exists  when  no  change  is 
being  processed,  thus  greatly  reducing  the  core-set  overhead. 

Each  set  of  mservers  in  the  physical  hierarchy  is  configured  as  a  core-set. 
This  serves  the  dual  purpose  of  having  a  core-set  readily  available  for  use  by  application 
groups  at  any  level  in  the  hierarchy,  and  allowing  each  set  of  mservers  to  process 
membership  changes  among  the  mservers  of  the  core-set  locally.  Thus,  each  level  of  the 
MS  hierarchy  is  responsible  for  managing  the  mservers  at  that  level  only.  Changes  to  the 
membership  of  the  core-set  are  processed  in  exactly  the  same  manner  as  membership 
changes  submitted  by  application  groups,  with  the  exception  that  these  changes  directly 
affect  the  core-set  membership  and  are  not  propagated  outside  of  the  core-set. 
Membership  changes  to  the  core-set  are  generated  by  ftulure  detections  from  within  the 
core-set  or  by  change  requests  sent  to  the  core-set  when  manual  configuration  of  the  MS 
physical  hierarchy  is  conducted  by  the  system  administrator. 

d.  Change-processing  Sequence 

The  basic  change-processing  sequence  uses  a  modified  form  of  the  three-way 
handshake  often  seen  in  unreliable  networks  for  reliable  message  delivery.  The 
coordinator  initiates  the  change  processing  with  a  multicast  to  all  core-set  mservers. 
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collects  acknowledgment  (ACK)  messages  from  all,  then  multicasts  a  final  message  for  all 
to  commit  the  change.  Timeouts  and  retries  are  used  by  mservers  waiting  to  receive  ACKs 
or  ('ommit  messages  from  other  mservers  to  ensure  that  continual  progress  is  made 
toward  completion  of  the  change.  As  with  the  monitoring  scheme,  if  the  correct  reply  is 
not  received  from  an  mserver  after  the  timeout  period  and  all  successive  retries,  then  that 
mserver  is  declared  failed  and  the  failure  is  announced  to  all  other  mservers  in  the  core-set. 

e.  Multicasts  and  Failure  Detection 

The  use  of  timeouts  and  retries  on  change-processing  messages  creates  a 
secondary  but  essential  method  of  detecting  mserver  failures.  Since  mserver  monitoring 
uses  unicast  messages  and  change-processing  uses  multicasts,  it  is  possible  that  a  network 
partition  could  occur  that  affected  only  multicast  message  delivery  between  one  or  more 
mservers.  The  inability  of  mservers  to  communicate  ail  necessary  data  creates  a  virtual 
partition  between  the  mservers.  Without  the  use  of  this  secondary  detection  method,  it  is 
possible  that  one  or  more  mservers  could  be  functioning  perfectly  well,  sending  the 
required  monitoring  messages,  but  unable  to  respond  to  change-processing  messages,  thus 
creating  a  deadlock  situation.  The  timeout  and  retries  on  change-processing  messages 
ensures  that  an  mserver  unable  to  communicate  will  be  detected  failed,  and  the  remaining 
mservers  will  be  able  to  complete  the  change  in  a  timely  manner.  In  the  event  of  a 
coordinator  failure  during  the  change  processing,  a  distributed  election  is  conducted  and  a 
new  coordinator  is  elected  to  continue  the  original  change. 

5.  LAN  Mserver  Monitoring 

Due  to  the  high  bandwidth,  low  latency,  hardware  multicast  capability,  and 
limited  number  of  Mis  to  monitor,  the  mserver  representing  each  LAN  uses  a  simple 
polling  scheme  to  conduct  status  monitoring  of  the  Mis  and  host  computers  on  the  LAN. 
Each  MI  on  the  LAN  is  successively  polled  with  a  Query  message  by  the  LAN  mserver. 
The  MI  responds  with  a  Reply  message  indicating  normal  status.  Timeouts  and  retries  are 
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used  to  detect  a  non-responding  MI,  declare  that  MI  failed,  and  announce  the  fiuhire.  A 
depiction  of  the  LAN  mserver  monitoring  scheme  is  shown  in  Figure  9. 
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Figure  9;  LAN  Mserver  Monitoring  of  Mis 

6.  Hierarchical  Structure 
a.  Collapsing  the  Tree 

The  final  organization  of  mservers  and  I  s  involves  forming  the 
monitoring-sets  and  core-sets  of  mservers  into  the  physical  hierarchical  structure  used  by 
the  IS,  with  the  I  s  at  the  leaf  level.  All  core-sets  are  also  monitoring-sets,  thus 
providing  the  failure  detection  needed  by  a  core-set  to  manage  the  mserver  membership 
locally.  As  shown  in  Figure  5,  each  mserver  in  the  hierarchy  has  either  a  set  of  children 
mservers  or  I  s.  All  mservers  and  I  s  also  have  a  parent  mserver,  except  the  mservers 
at  the  highest  level  of  the  hierarchy.  To  create  this  physical  structure,  the  logical  hierarchy 
of  Figure  5  is  "collapsed",  so  that  each  parent  mserver  becomes  a  member  of  the  core-set 
of  children  mservers  below  it,  as  well  as  a  member  of  the  core-set  of  peer  mservers.  Thus, 
each  mserver  above  the  lowest  level  in  the  hierarchy  has  a  dual  membership  in  the 
"child-set"  as  well  as  the  original  core-set  of  mservers.  Figure  10  illustrates  this  structure. 
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Figure  10:  "Collapsed"  MS  Architecture 


b.  Parent  Afservers 

A  comparison  of  the  logical  MS  hierarchy  shown  in  Figure  S  with  the 
physical  MS  hierarchy  shown  in  Figure  10  shows  the  same  sets  of  mservers  and  Mis. 
However,  the  sets  can  now  be  identified  as  change-processing  core-sets,  linked  to  the  level 
above  by  the  dual  membership  of  the  parent  mserver.  Having  the  parent  mserver  as  a 
member  of  the  child-set  has  two  primary  advantages.  First,  the  parent  mserver  is  part  of 
the  child  monitoring-set;  thus,  the  child-set  will  immediately  learn  of  the  failure  of  the 
parent  mserver  by  monitoring.  Second,  the  parent  mserver  takes  part  in  all  change 
processing  conducted  by  the  child-set;  therefore,  it  will  learn  of  any  changes  in  the 
membership  of  the  child-set  directly.  Together,  these  two  points  ensure  that  "vertical 
monitoring"  is  conducted  in  the  hierarchy.  This  provides  the  means  to  ensure  that  a  failure 
or  partition  between  levels  in  the  MS  hierarchy  will  be  detected,  allowing  the  MS  to 
reform  as  necessary. 


B.  APPLICATION  GROUPS 

Support  for  application  process  groups  is  the  primary  reason  for  the  MS.  The  MS  is 
responsible  for  managing  the  membership  of  the  application  process  groups  and  providing 
services  to  the  application  process  groups.  The  following  sections  describe  how  the  MS 
accomplishes  this. 

1.  Application  Groups  and  the  Physical  Hierarchy 

a.  Scalability 

The  application  groups  consist  of  processes  running  on  host  computers 
distributed  throughout  the  networks  supported  by  the  MS.  As  shown  in  Figure  2,  the  MS 
provides  the  necessary  services  to  make  an  application  consisting  of  numerous  distributed 
processes  to  appear  as  a  unified  application  running  at  a  single  site.  Because  of  the 
scalability  of  the  underlying  MS  architecture,  the  application  process  groups  are 
completely  scalable  in  number  and  distribution  of  processes,  with  the  end  result  bang 
complete  transparency  of  the  distributed  nature  of  the  MS  to  the  service  users. 

b.  CoHsisten<y 

The  primary  service  that  the  MS  provides  application  groups  is  a  consistent 
view  of  the  group  membership  at  all  members,  as  well  as  a  consistent  ordering  of  changes 
to  the  membership  of  the  group  at  all  members.  These  consistency  guarantees  ensure  that 
a  process  group  member  either  acquires  the  same  consistent  view  as  all  other  members  of 
the  group  eventually,  or  is  excluded  from  the  membership  of  the  group.  The  term 
"eves!itually"  refers  to  the  asynchronous  nature  of  the  environment,  leading  to  delays  at 
some  sites.  The  MS  allows  for  reasonable  delays,  thus  ensuring  that  all  surviving 
processes  will  receive  the  revised  group  view.  Using  this  guarantee  of  consistent 
membership  at  all  processes,  the  application  can  safely  make  certain  assumptions  about  the 
member  processes.  The  application  can  expect  that  processes  with  the  same  group  view 
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number  have  seen  the  same  sequence  of  membership  changes,  and  currently  have  the  same 
view  of  the  membership  of  the  group.  Using  this  knowledge,  the  application  can  decide  to 
accept  or  reject  messages  from  other  application  processes  depending  on  the  included 
group  view  number.  The  guarantee  of  consistent  membership  can  be  used  as  the 
foundation  upon  which  to  build  many  distributed  applications. 

I'he  MS  provides  consistent  ordering  of  membership  changes  to  application 
groups  by  ensuring  that  only  one  change  is  ever  processed  at  a  time  in  the  core-set  of  that 
applic?^  group,  and  that  all  active  member  processes  eventually  receive  this  change. 
The  seiv.  iCd  change  is  committed  by  all  core-set  mservers,  then  reliably  propagated  to  the 
Mis,  and  finally,  to  the  distributed  application  member  processes.  The  MS  provides  the 
guarantee  that  an  application  member  process  either  receives  each  revised  group  view  or  is 
detected  as  failed,  and  excluded  from  the  group.  In  this  manner,  all  surviving  application 
member  processes  are  guaranteed  to  have  exactly  the  same  ordering  of  membership 
changes. 

c.  Naming 

The  MS  manages  the  names  of  ail  application  groups  using  the  MS. 
Application  group  names  are  guaranteed  unique  within  a  predetermined  scope.  When  an 
application  group  is  created,  the  software  call  from  the  application  to  the  MS  includes  as  a 
parameter  a  level  in  the  MS  physical  hierarchy,  under  which  the  application  group  name 
will  be  guaranteed  unique.  This  name-scope  parameter  is  either  the  actual  name  of  the 
core-set  or  a  level  number  above  the  MI  level  in  the  physical  hierarchy.  For  example,  to 
guarantee  an  application  group  name  of  "application!"  as  unique  under  the  scope  of  the 
NPS  core-set  from  Figure  5,  the  name  NFS  or  the  level  number  2  would  be  used  as  the 
name-scope  parameter.  The  name-scope  level  must  be  at  or  above  the  core-set  level  for 
the  application. 

With  the  creation  of  each  new  application  group,  the  name-scope  parameter 
is  checked  at  each  level  in  the  mserver  hierarchy  up  to  and  including  the  name-scope  level 
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If  the  name  already  exists,  the  creation  of  the  new  group  is  refused,  and  an  error  code  is 
returned  to  the  calling  application.  If  the  name  is  not  found,  then  it  is  registered  at  the 
name-scope  level  of  mservers  and  at  each  level  in  the  hierarchical  tree  of  the  application, 
and  a  successful  group  creation  is  reported  to  the  calling  application.  When  new 
application  member  processes  at  distributed  locations  wish  to  join  an  existing  application 
group,  a  join  request  is  submitted  via  the  residmt  Ml,  then  propagated  up  the  hierarchy 
until  either  an  mserver  is  located  with  the  application  name  stored  or  the  highest  level  in 
the  physical  hierarchy  is  reached  and  the  iq)plication  name  is  not  located.  If  the  desired 
application  group  name  is  located,  the  new  member  is  joined  into  the  application  group 
through  the  normal  change-processing  sequence,  and  a  successful  join  is  reported  back  to 
the  requesting  process.  If  the  name  is  not  located,  an  unsuccessful  join  attempt  is  reported 
back.  Through  judicious  use  of  the  name-scope  parameter,  application  names  may  be 
used  freely  with  little  concern  about  duplicate  name  usage. 

d.  Membership  Scope  Control 

An  additional  feature  provided  by  the  MS  is  the  ability  for  an  application  to 
decide  at  what  level  in  the  MS  physical  hierarchy  to  limit  the  scope  of  the  application 
group.  By  providing  a  membership-scope  parameter  with  the  creation  call  for  a  new 
application  group,  the  application  guarantees  that  the  span  of  the  application's 
membership  will  not  exceed  the  given  core-set  level  in  the  physical  hierarchy.  In  return, 
the  MS  is  able  to  provide  more  efficient  service  by  limiting  the  scope  of  application  group 
name  searches  to  the  membership-scope  level  and  below.  Instead  of  propagating  every 
unsuccessful  application  group  name  search  to  the  highest  level  of  the  MS  hierarchy,  the 
name  search  will  cease  at  the  membership-scope  level.  Without  use  of  the 
membership-scope,  it  might  be  possible  for  a  bottleneck  to  form  at  the  "top"  of  the  MS 
hierarchy. 
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2.  Member  Interfaces 


a.  Purpose 

As  previously  described,  the  Mis  provide  the  interface  between  application 
group  member  processes  and  the  MS  physical  hierarchy.  They  accept  application 
membership  change  and  information  requests  from  application  processes  and  submit  these 
changes  to  the  mserver  hierarchy  for  procesring.  When  the  change  or  information  data  is 
returned,  the  MI  passes  the  data  to  the  requ^ing  member  processes. 

As  shown  in  Figure  9,  each  MI  is  running  on  an  individual  host  computer.  Each 
Ml  is  capable  of  inter&cing  multiple  iq)plication  groups,  each  with  multiple  members,  with 
the  LAN  mserver  and  the  MS.  Each  MI  maintains  a  list  of  all  application  groups  it  is 
managing  as  well  as  all  member  processes  from  these  groups  running  on  the  host 
computer.  Thus,  the  membership  information  for  each  application  group  is  nuuntained  in  a 
decentralized,  scalable  manner.  When  an  application  member  process  needs  to 
communicate  with  another  application  member  process  on  a  different  host,  it  submits  a 
request  for  addressing  information  to  the  MI.  The  MI  relays  this  information  request  to 
the  MS,  which  obtains  the  desired  information  from  the  MI  managing  the  desired  member 
process,  and  relays  the  information  back  to  the  requesting  MI  and  application  member 
process. 

b.  AppUcadon  Member  Process  MoitUoring 

The  Mis  monitor  the  application  member  processes  in  exactly  the  same 
manner  that  the  LAN  mserver  monitors  the  Mis  on  the  LAN;  using  polling.  In  the  same 
manner,  non-responding  application  processes  are  detected  ftiled,  the  bulure  is 
announced,  then  submitted  to  the  MS  for  an  application  group  membership  change. 
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3.  Application  Group  Change  Processing 

As  previously  discussed,  application  group  change  processing  begins  with  the 
submission  of  a  change  request  to  the  host  MI.  This  request  is  relayed  to  the  core-set  of 
the  application,  which  conducts  the  mserver  change-processing  procedure,  resulting  in  all 
core-set  mservers  committing  the  change  Each  core-set  mserver  then  reliably  relays  the 
change  directive  down  the  hierarchy  to  the  MI,  and  then  to  the  requesting  application 
process.  When  the  change  is  submitted  by  the  MI,  a  timer  is  set  to  ensure  a  timely 
response  to  the  change  The  MI  waits  for  the  returning  Direct  message  from  the  LAN 
mserver.  If  the  timer  expires  before  receiving  the  Direct  message,  a  query  message  is  sent 
to  the  LAN  mserver  requesting  the  status  of  the  change  submitted  The  LAN  mserver  will 
respond  with  a  Wait  message  if  the  change  is  still  being  processed,  causing  the  MI  to  wait 
for  a  period  before  querying  the  mserver  again  If  the  MI  completes  all  timeouts  and 
retries  and  still  has  not  received  a  reply  from  the  LAN  mserver,  it  detects  the  LAN 
mserver  failed  and  announces  the  failure.  In  the  same  manner,  each  intermediate  mserver 
also  sets  a  timer  for  a  response  from  the  next  higher  level  mserver.  A  non-response  leads 
to  a  partition  in  the  physical  hierarchy.  To  ensure  reliable  transmission  from  the  core-set 
to  the  application  process,  each  intermediate  mserver  and  MI  send  an  ACK  message  back 
to  the  mserver  above  upon  receipt  of  the  Direct  message.  Timeouts  and  retries  are  again 
used  to  detect  failures  and  partitions.  At  the  end  of  the  application  change  processing 
sequence,  every  application  member  process  is  guaranteed  to  have  received  the  change 
message  or  to  have  been  detected  as  failed. 
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IV.  MEMBERSHIP  SERVICE  PROTOCOLS 


The  previous  chapter  described  the  component  entities  of  the  MS;  the  mservers  and 
Mis.  The  organization  of  the  mservers  and  MI  into  the  MS  physical  hierarchy  was 
described  in  detail,  as  well  as  thdr  basic  functionality.  This  chapter  describes  in  detail  the 
protocols  used  by  the  mservers  and  MI  to  implement  the  MS,  and  the  general  format  of 
messages  used  to  exchange  membership  information  between  mservers  and  MI. 

1.  General  Message  Types 

The  general  message  types  used  by  the  MS  and  descriptions  of  each  are  listed  in 
Table  4.  There  are  three  general  classifications  of  messages:  Monitoring,  Initiate,  and 
Change  Processing.  Many  of  these  messages  are  used  for  more  than  one  purpose,  such  as 
processing  changes  to  the  physical  hierarchy  of  mservers  and  MI  as  well  as  changes  to 
application  process  groups.  The  Monitoring  messages  are  used  by  mserver  in  the 
monitoring-set  to  conduct  pairwise  peer-to-peer  monitoring,  by  the  LAN  mserver  to 
monitor  the  Mis  on  the  LAN,  and  by  the  Ml  to  monitor  application  process  members. 
The  type  of  monitoring  being  conducted  is  determined  by  the  members  involved  and  the 
context  of  the  message  used.  The  Initiate  category  of  messages  are  used  to  initiate  a 
change  for  either  the  physical  hierarchy  or  an  application  process.  The  Join  message  is 
used  to  join  a  new  mserver  to  an  existing  core-set  of  mservers,  create  a  new  core-set  with 
one  mserver,  to  join  a  new  application  member  process  to  an  existing  group,  or  to  create  a 
new  application  group.  The  Leave  message  allows  a  voluntary  departure  by  an  mserver 
from  a  core-set  or  an  application  process  from  the  group.  The  Split  and  Merge  message 
types  are  the  general  form  of  Join  and  Leave,  allowing  multiple  mservers  or  application 
processes  to  join  or  leave  a  core-set  or  application  group,  respectively.  The  Add jxrrent 
and  Del  jxtrent  message  types  are  used  by  an  existing  core-set  to  adopt  or  remove  a 
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TABLE  4;  MS  GENERAL  MESSAGE  TYPES 


Description 

Monitoring 

Used  by  msm^ers  and  Ml  to  determine  status  of  others 

Query 

Query  status  of  another  mserver  or  MI 

Reply 

Reply  to  Query 

Initiate 

Initiate  a  phyacal  or  application  group  change 

Join 

Mserver  join  a  core-set  or  i^>plication  process  join  a  group 

Leave 

Mserver  leave  a  core-set  or  application  process  leave  a  group 

Split 

Split  from  core-set  or  group  to  form  a  new  set  or  group 

Merge 

Merge  separate  core-sets  or  application  groups  into  one 

Add jxirent 

Core-set  add  an  mserver  as  parent 

Del jHjreni 

Core-set  remove  the  existing  parent  mserver 

Fail 

Mserver,  MI,  or  application  member  process  detected  failed 

Mserver  coordinator  of  current  change  detected  failed 

Submit 

MI  submit  change  to  core-set  (same  types  as  Initiate) 

Direct 

Core-set  change  directive  to  Mis  (same  types  as  Initiate) 

Process  Change 

Used  to  process  a  physical  or  application  group  change 

ACK 

Acknowledge  Initiate  or  Direct  messages 

Wait 

Wait  to  begin  processing  change  or  for  next  message  in  change 

Commit 

Commit  the  current  change 

Msg  Query 

Query  mserver  for  status  of  next  message  expected  in  change 

Init 

Initial  parameters  message  from  coordinator  to  joining  mserver 

parent  mserver.  This  action  is  the  primary  function  used  to  create  the  hierarchy  of 
core-sets.  The  Fail  message  type  is  used  to  announce  the  failure  of  an  mserver  or 
application  member  process  and  initiate  the  change  to  remove  the  failed  member  from  the 
core-set  or  application  group.  The  Coord  Fail  message  is  used  to  announce  the  failure  of 
the  coordinator  mserver  for  the  current  change  being  processed.  This  message  will 
prompt  the  election  of  a  new  coordinator,  which  will  complete  the  original  change.  The 
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third  category  of  general  message  types  are  those  actually  used  to  process  membership 
changes  to  a  core-set  of  mservers  or  an  application  group.  The  ACK  message  is  a  general 
acknowledgment  message  used  to  indicate  successful  reception  of  an  Initiate  or  Direct 
message.  The  Wait  message  is  used  by  the  coordinator  of  the  current  change  or  by  an 
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Figure  1 1 ;  Mserver  Messages 

mserver  propagating  a  Submit  message  to  inform  querying  mservers  that  a  there  is  a  delay 
in  completing  the  current  change  and  that  they  should  wait  for  a  period  for  the  next 
expected  message.  It  is  also  used  by  a  core-set  mserver  to  inform  another  mserver 
attempting  to  initiate  a  new  change  that  the  current  change  is  not  yet  completed,  and  the 
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new  coordinator  should  wait  a  period  brfore  b^inning  the  next  change.  The  Commit 
message  is  sent  by  the  coordinator  to  inform  all  core-set  mservers  that  it  is  safe  to  commit 
the  current  change  as  the  new  group  view  and  to  propagate  this  view  to  application 
processes  as  needed.  The  Msg  Query  message  is  used  by  an  mserver  or  MI  to  query 


Figure  12.  Member  Interface  (Ml)  Messages 

another  mserver  about  the  status  of  the  message  for  the  current  change  expected  from  the 
queried  mserver.  The  mserver  receiving  the  Msg  Query  will  usually  respond  with  a  Wait 
message  or  the  expected  message,  if  it  is  determined  that  the  message  was  lost.  The 
Submit  message  is  used  by  an  MI  submitting  an  application  group  change  to  the  LAN 
mserver,  then  by  each  mserver  in  the  hierarchy  to  propagate  the  change  request  to  the 
application  core-set.  The  Submit  is  in  effect  a  remote  Initiate  message,  and  has  the  same 
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category  of  types  as  an  Initiate  message.  The  Direct  message  is  used  by  the  core-set 
mserver  with  application  members  at  their  leaf  level  to  propagate  the  committed 
application  change  down  the  hierarchy  to  the  Mis  representing  the  application  group,  and 
also  has  the  same  category  of  message  types  as  an  Initiate  message  Figure  1 1  illustrates 
the  messages  sent  to  and  received  by  an  mserver,  while  Figure  12  illustrates  the  messages 
sent  to  and  received  by  an  MI. 
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Figure  13;  Membership  Service  General  Message  Format 


2.  General  Message  Format 

The  general  message  format  used  by  the  MS  is  shown  in  Figure  13.  An 
explanation  of  the  meaning  of  each  message  held  is  provided  in  Table  S.  The  exclude  and 
subject  lists  shown  in  Figure  13  are  queues  maintained  by  each  mserver,  which  are 
included  with  certain  types  of  messages.  Each  element  of  these  lists  contains  the  minimal 
amount  of  information  to  uniquely  identify  an  mserver  or  application  process,  when 
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TABLE  5;  MS  GENERAL  MESSAGE  FIELDS 


Description 

vers 

MS  version  number 

checksum 

used  for  message  error  detection 

group_name 

core-set  or  application  group  name 

authentication 

used  for  group  security 

groupview 

current  core-set  or  application  group  view  number 

msgtype 

message  type 

sender _gid 

message  sender's  group  identification  (gid)  number 

subject_gid 

message  subject's  group  identification  (gid)  number 

subjectrank 

seniority  based  rank  of  mserver  or  application  member  process 

excludejist 

list  of  mservers  to  be  excluded  from  core-set  due  to  failure 

excljistlen 

number  of  mservers  in  exclude_list 

subjectlist 

list  of  subjects  for  a  Merge  or  Splits  or  failed  mservers  or  members 

subjlistjen 

number  of  mservers  or  members  in  subjecljist 

data 

general  purpose  data  field 

datajen 

length  of  data  included  with  message 

combined  with  the  information  about  the  core-set  or  application  group  contained  in  the 
message  The  lists  are  used  to  communicate  information  about  sets  of  mservers  or 
application  processes.  The  exclude  list  serves  a  dual  purpose:  to  ensure  that  failed 
mservers  are  not  included  in  the  communications  of  the  core-set,  and  to  inform  other 
core-set  mservers  of  mservers  perceived  failed  before  they  are  actually  removed  from  the 
core-set.  Bt  ause  the  network  multicast  capability  assumed  by  the  MS  is  unable  to 
dynamically  tailor  the  receivership  of  each  multicast  message,  a  filter  mechanism  must  be 
used  to  ensure  that  unintended  mservers  do  not  receive  the  current  message.  An  mserver 
which  is  detected  as  failed  is  added  to  the  exclude  list  of  the  current  message  sent  by  the 
detecting  mserver.  Mservers  receiving  this  message  will  cease  all  communications  with 
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the  excluded  mserver.  If  an  mserver  is  still  functioning  and  receives  this  message  with 
itself  listed  in  the  exclude  list  it  will  immediately  cease  all  communications  with  all 
mservers  in  the  core-set,  with  the  possible  exception  of  other  mservers  in  the  exclude  list, 
with  which  it  will  attempt  to  reform  a  new  core-set.  This  method  of  "piggy-backing"  the 
detected  failure  of  mservers  with  another  message  is  referred  to  as  "gossip"  [9] 

3.  General  State  of  Mservers  and  Mis 

The  MS  maintains  information  about  the  physical  hierarchy  and  application 
groups  in  a  decentralized  manner.  Individual  mservers  and  Mis  need  only  maintain  the 
information  necessary  to  perform  their  required  functions.  This  decentralized  storage  of 
MS  information  is  essential  to  the  scalability  of  the  MS. 

a.  Mservers 

For  an  mserver,  the  information  stored  about  the  physical  hierarchy  includes 
the  gids,  ranks,  and  addresses  of  other  mservers  in  the  core-set  and  child-set  of  mservers 
of  which  it  is  a  part;  the  monitor,  and  monitored  mservers  for  each  of  these  sets  of 
mservers;  the  parent  mserver  of  the  core-set;  information  about  the  current  and  previous 
changes  processed;  and  queues  of  mservers  detected  failed  ,  received  change  requests,  and 
excluded  mservers.  Each  mserver  also  maintains  information  about  the  application  groups 
that  it  supports.  This  information  includes  a  list  of  application  groups  supported  by  the 
mserver;  which  children  mservers  are  in  the  application  group's  hierarchy;  and  whether  the 
mserver  level  is  the  core-set,  memberhsip-scope,  or  name-scope  level  of  the  application. 

h.  Mis 

Each  MI  must  store  information  about  the  application  groups  and  their 
members  that  it  is  supporting  on  the  host  computer,  as  well  as  the  address  of  the  parent 
mserver  Any  other  required  information  is  obtained  through  the  MS  hierarchy. 
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4.  Physical  and  Application  Group  Protocols 

To  perform  the  various  functions  described,  mservers  and  Ml  use  five  primary 
protocols.  These  protocols  are;  the  physical  monitoring  protocol,  the  application  group 
monitoring  protocol,  the  physical  core-set  change-processing  protocol,  the  application 
group  change-processing  protocol,  and  the  network  partition  resolution  protocol.  Each  of 
these  protocols  are  described  in  detail  in  the  following  section,  using  psuedo-code 
algorithm  listings  and  event  diagram  illustrations. 

A.  PHYSICAL  MONITORING  PROTOCOL 

1.  Pairwise  Monitoring 

As  described  in  the  previous  chapter,  mservers  in  the  physical  hierarchy  are 
arranged  into  monitoring-sets  for  the  purpose  of  detecting  mserver  failures.  Within  these 
monitoring-sets,  the  mservers  conduct  pairwise,  peer-to-peer  monitoring.  The  monitoring 
mserver  periodically  sends  a  Query  message  to  the  monitored  mserver,  which  responds 
with  a  Reply  message  indicating  normal  operation.  The  algorithm  for  this  physical 
monitoring  protocol  is  shown  in  Figure  14,  with  the  description  following. 

Monhoring  mserver 

/*  when  monitoring  timer  has  expired  */ 

1 .  formmessage  (Query,  currentjchange,  exclude_list) 

2.  send_message  (Query  message,  monitored  mserver) 

3 .  messg  =  Rdiablereoeive  (Reply_message,  Querymessage) 

4  if (mes%  !=  Repiy_message)  /*  fiiiled  mserver  *! 

5.  declare  the  mcMiitored  mserver  fiuled 

6.  dse 

7.  reset  monitofing  timo- 

Figure  14.  Physical  Monitoring  Protocol 
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Figure  14  shows  the  procedure  used  by  the  monitoring  mserver.  In  lines  1  and  2 
the  Query  message  is  sent  to  the  monitored  mserver.  Line  3  uses  a  function  called 
Reliable  receive,  explained  in  the  next  section,  which  a  uses  timeouts  and  retries  to  ensure 
the  Rep/y  message  is  received  in  a  reasonable  period  of  time.  Lines  4  and  5  detect  the 
monitored  mserver  failed  if  it  did  not  respond  to  the  Query  Finally,  line  7  resets  the 
monitoring  timer  to  repeat  the  process  after  a  suitable  period. 

2.  Failure  Detection,  Timeouts,  and  Retries 

The  primary  means  of  detecting  an  mserver  failure  through  monitoring  is  the  use 
of  timeouts  and  retransmissions  of  the  Query  message.  If  after  a  preset  number  of 
timeouts  and  retries  the  monitored  mserver  still  has  not  responded,  it  is  assumed  to  have 
failed.  The  Reliable_receive  function  in  Line  3  of  Figure  14  performs  this  timeout  and 
retry  sequence.  The  function  is  termed  "reliable"  because  it  ensures  a  reliable 
communication  over  a  single  link;  that  is,  the  monitored  mserver  either  responds  in  a 
reasonable  period  or  is  determined  to  have  failed.  The  algorithm  for  this  function  is  listed 
in  Figure  15. 


Reliable_receive  (expeaed_messg,  querymessg) 

/*  returns  the  received  message  *! 

1 .  settimer  (timeout) 

2.  retries  =  n 

3.  while  ((timer  not  expired)  and  (messg  !=  e7q3ected_messg)) 

4.  receive_message  (messg) 

5.  if  ((messg  !=  expectedjnessg)  and  (retries  >  0)) 

6.  retries  =  retries  - 1 

7.  send_message(query_mes^  destination) 

8.  set  timer  (timeout) 

9.  goto  3. 

10.  return  messg 


Figure  15;  Reliable  receive 
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Line  I  and  2  initialize  the  function.  Line  3  begins  the  nuun  reception  loop.  Line 
4  is  a  timed  receive  function,  which  returns  the  received  message  immediately  upon 
reception  or  times  out  waiting  and  returns  a  NULL  message.  The  main  loop  is  executed 
until  the  expected  message  is  received  or  the  main  timeout  period  expires.  Lines  5 
through  9  perform  the  retry  sequence.  If  the  expected  message  is  received  the  function 
returns  immediately,  otherwise,  the  function  times  out  and  returns  whatever  message,  if 
any,  was  received.  By  examining  the  returned  message,  the  monitor  is  able  to  decide  if  the 
monitored  mserver  has  failed. 

B.  APPLICATION  GROUP  MONITORING 

The  protocol  used  by  an  MI  to  monitor  the  status  of  the  application  member 
processes  that  it  is  interfacing  is  exactly  the  same  as  that  used  by  the  LAN  mserver  to 
monitor  the  Mis  running  on  the  host  computers  of  the  LAN.  The  MI  periodically  queries 
each  application  member  process,  using  the  Reliable_receive  function,  and  declares  any 
application  processes  not  responding  as  failed.  Figure  9  shows  the  arrangement  of  LAN 
mserver.  Mis,  and  application  member  processes. 

C.  PHYSICAL  CORE-SET  CHANGE  PROCESSING 

At  the  heart  of  the  MS  is  the  ability  of  a  small  set  of  mservers  to  make  a  consistent, 
mutually  agreed  upon  decision  about  membership  changes  to  physical  sets  of  mservers  and 
application  groups  of  any  size.  This  section  describes  the  types  of  changes  processed,  the 
basic  change-processing  protocol  used  by  a  core-set  to  conunit  these  changes,  and 
additional  protocols  used  in  the  event  of  failures  of  mservers  within  the  core-set. 

1.  Coordinator 

The  coordinator  is  the  core-set  mserver  responsible  for  coordinating  the 
processing  of  the  current  membership  change.  One  of  the  strengths  of  the  MS  is  that  any 
mserver  can  become  the  coordinator,  either  initially  upon  detecting  or  receiving  a  change. 
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or  following  the  failure  of  the  current  coordinator.  Also,  the  coordinator  only  exists  in 
that  capacity  while  the  current  change  is  being  processed,  when  there  are  no  membership 
changes  to  process  in  a  core-set,  there  is  no  coordinator  The  coordinator  for  each  change 
is  determined  by  a  combination  of  the  type  of  change,  which  core-set  mservers  detect  or 
receive  the  change,  and  a  priority  associated  with  each  change.  In  addition  to  determining 
which  mserver  will  act  as  coordinator,  these  criteria  also  ensure  that  only  one  change  at  a 
time  is  committed  by  a  core-set. 

2.  Types  of  Changes 

There  are  three  primary  types  of  membership  changes  processed  by  a  core-set  of 
mservers:  requests,  failures,  and  dynamic  reconfigurations. 

a.  Requests 

Requests  are  voluntary,  planned  membership  changes,  submitted  to  the 
core-set  for  processing  by  an  application  process,  membership  service  user,  or  system 
administrator.  Change  requests  for  the  MS  physical  hierarchy  may  be  of  any  type  listed  in 
Table  4,  with  the  exception  of  Fail  and  Coord  Fail.  Application  group  change  requests 
may  be  of  any  type  used  for  physical  change  requests  except  Add  jktrent  and  Del jparent. 
Physical  change  requests  are  multicast  to  a  specific  core-set  in  the  hierarchy  by  a  system 
configuration  call,  usually  invoked  by  a  system  administrator  during  manual  configuration 
of  the  MS  hierarchy.  The  physical  change  request  is  received  by  all  mservers  in  the 
selected  core-set.  Each  receiving  mserver  queues  the  request,  to  be  processed  when  other 
higher  priority  changes  have  completed  processing.  Application  group  change  requests 
are  submitted  to  the  resident  MI  process  on  the  host  computer  by  the  application  or  the 
MSU.  The  MI  then  propagates  the  request  to  the  core-set  mserver  above  it  in  the 
hierarchy.  The  receiving  core-set  mserver  queues  the  request  to  be  processed  when  other 
higher  priority  changes  have  completed  processing 
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b.  Fmiures 

The  second  primary  type  of  membership  changes  are  detected  failures. 
These  detected  failures  may  be  the  result  of  the  actual  failure  of  an  mserver,  MI,  or 
application  process,  or  the  host  machines  upon  which  they  are  running.  Additionally, 
network  partitions  will  be  perceived  as  failures  of  the  partitioned  mservers,  and  will  lead  to 
the  processing  of  failures  and  reformation  of  the  partitioned  subsets  of  mservers  and 
subgroups  of  application  processes.  The  partitioning  of  the  MS  physical  hierarchy  leads  to 
a  partitioning  of  the  application  groups  residing  on  this  hierarchy.  The  MS  automatically 
reforms  both  the  physical  hierarchy  and  the  supported  application  groups  in  the  event  of  a 
network  partition.  Failures  detected  or  received  by  a  core-set  mserver  are  queued  and 
processed  according  to  their  priority.  Multiple  failures  queued  at  a  core-set  mserver  are 
processed  all  at  once,  in  a  "batched"  manner.  This  greatly  reduces  the  time  required  to 
reform  physical  core-sets  or  application  groups. 

c.  Dynamic  Reconfigura^ns 

The  final  type  of  changes  are  the  result  of  automatic  actions  taken  by 
core-sets  of  mservers.  As  part  of  the  processing  of  multiple  failures  caused  by  a  network 
partition,  a  core-set  is  often  partitioned  into  two  or  more  subsets.  After  the  reformation 
into  subsets  has  occurred,  these  sub-core-sets  attempt  to  reform  into  the  original  core-set 
by  sending  messages  to  the  other  subsets  of  mservers.  Since  the  sub-core-sets  still  share 
the  same  multicast  address,  once  the  network  partition  is  mended,  the  other  sub-core-sets 
receive  these  refor  tion  messages.  Upon  learning  of  the  existence  of  a  sub-core-set  from 
the  original  core-set,  the  partitioned  subsets  of  mservers  reform  into  the  original  core-set 
automatically.  In  addition  to  reforming  the  physical  core-set,  all  application  groups  which 
were  partitioned  and  are  still  functioning  are  also  reformed.  The  reformation  process  for 
both  physical  core-sets  and  application  groups  merges  the  currently  existing  membership 
of  each,  taking  the  union  of  all  subsets  or  subgroups,  and  making  the  reformed  core-set  or 


45 


application  group  membership  the  current  view.  In  the  event  that  the  n^work  partition  is 
not  repaired  in  a  predetermined  period  of  time,  the  partitioned  subsets  of  mservers  v^ll 
abandon  their  attempts  to  reform  the  original  core-set,  and  will  create  a  new  multicast 
group  with  only  the  curroit  core-set  mserver  included. 

Airather  type  of  dynamic  reconfiguration  occurs  when  new  members  join  an 
application  group,  causing  the  span  of  the  application  ^oup  to  increase  beyond  that 
presently  covered  by  the  current  application  core-set.  In  this  evem,  the  application 
core-set  must  be  moved  from  the  present  level  in  the  physical  hierarchy  to  a  hitter  level 
covering  the  new  span  of  the  application.  This  new  level  must  be  at  or  below  the 
name-scope  and  membership-scope  levels  of  the  application  group,  if  these  levds  were 
designated  when  the  application  group  was  created.  The  MS  automatically  nK>ves  the 
application  core-set  to  the  new  level.  In  a  similar  numner,  the  departure  of  application 
member  processes  may  lead  to  a  reduced  span  of  the  application.  An  application  core-set 
must  have  at  least  two  mservers  with  application  members  in  their  subtrees;  otherwise, 
there  is  no  need  to  have  the  application  core-set  at  this  level  in  the  hierarchy.  If  the 
application  core-set  is  reduced  to  only  one  mserver  supporting  an  application,  the 
application  core-set  will  automatically  move  down  to  the  child-set  of  this  mserver. 

The  repositioning  of  an  application  core-set  is  initiated  by  the  set  of  mservers 
detecting  the  need  to  move  the  application  core-set.  Messages  are  exchanged  between  the 
old  and  new  core-sets  and  a  change  involving  the  join  or  departure  of  the  instigating 
application  member  is  processed  along  with  the  change  in  application  core-set  level  by 
both  core-sets.  After  committing  the  changes,  the  internal  state  of  all  mservers  in  both 
core-sets  is  changed  to  reflect  the  new  application  core-set  level. 

3.  Ordering  and  Priority  of  Change  Processing 

A  key  issue  associated  with  processing  membership  changes  to  sets  of  mservers 
or  application  groups  is  the  ordering  of  changes  committed  by  the  core-set.  As  previously 
described,  to  guarantee  consistent  ordering  of  membership  changes  at  all  mservers  in  the 
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core-set,  only  one  change  may  be  committed  at  a  time.  However,  it  is  posable  that  more 
than  one  membership  change  may  be  submitted  to  or  detected  by  the  core-set  at  one  time. 
Each  receiving  or  detecting  mserver  in  the  core-set  will  attempt  to  become  the  core-set 
coordinator  and  initiate  the  change  it  received  or  detected.  These  multiple  change 
initiation  attempts  are  referred  to  as  "virtually  simultaneous",  since  they  have  all  been 
initiated  before  the  core-set  has  reached  a  consistent  and  uniform  decision  on  the  current 
change  to  process.  If  the  core-set  had  already  chosen  a  current  change  and  coordinator,  a 
newly  initiated  change  would  be  processed  alter  completion  of  the  current  change. 

To  resolve  these  virtually  simultaneous  changes  and  select  only  one  change  to  be 
processed,  a  prioritization  scheme  is  used.  This  prioritization  scheme  uses  the  type  of 
change  and  the  gid  and  rank  of  the  subject  of  the  change  to  decide  which  change  will  be 
processed  by  the  core-set.  The  highest  priority  is  given  to  any  current  change  being 
processed  by  the  core-set;  that  is,  a  change  which  has  been  consistently  accepted  by  all 
core-set  mservers.  It  is  essential  that  such  a  change  progresses  to  completion  at  all 
core-set  mservers;  otherwise,  the  possibility  of  inconsistent  membership  views  exists  if 
some  mservers  commit  the  change  while  others  do  not.  The  next  higher  priority  is  that 
physical  changes  always  have  priority  over  application  group  changes.  This  is  because  it  is 
important  to  ensure  a  complete  and  whole  MS  before  attempting  to  change  the 
membership  of  an  application  group  using  the  MS.  Once  these  decisions  have  been  made, 
the  priority  of  the  change  is  determined  by  the  rank  of  the  subject  of  the  change.  The  only 
exceptions  to  this  rule  are  for  the  foilure  of  the  coordinator  of  the  current  change  or 
Join.  The  failure  of  the  coordinator  has  priority  over  otherwise  equal  status  changes.  A 
newly  joining  mserver  or  member  will  not  have  an  associated  rank  until  after  the  join  is 
completed.  For  this  reason,  the  network  address  of  the  joining  member  is  used  as  a  rank 
number  to  give  a  priority  among  Joins.  The  final  rule  used  to  determine  the  priority  of 
virtually  simultaneous  changes  is  applicable  when  changes  are  submitted  to  the  core-set  by 
different  application  groups  with  identical  subject  rankings  in  each  group.  In  this  case,  a 
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tie-breaker  is  needed,  and  the  rank  of  the  receiving  mserver  is  used  to  decide  wMch 
change  will  be  processed. 

4.  The  Basic  Change-processing  Protocol 

As  discussed  in  the  description  of  an  mserver  core-set,  the  base 
change-processing  protocol  used  by  a  core-set  is  a  modified  version  of  the  thrM-way 
handshake  used  in  unreliable  networks  to  ensure  reliable  message  delivery.  An  event 
diagram  illustrating  the  sequence  of  message  transmisaons  and  receptions  is  shown  in 
Figure  16.  The  algorithm  for  the  coordinator  of  the  base  change  is  listed  in  Figure  17. 
The  algorithm  for  the  non-coordinator  core-set  mservers  is  listed  in  Figure  18. 


Phasel  PhawU 

ImitiMte  Com$mit 


Figure  16.  Basic  Two-Phase  Change-Processing  Protocol 


The  basic  change-processing  protocol  consists  of  two  phases;  the  Initiate  and 
Commit  phases.  In  the  Initiate  phase,  the  coordinator  multicasts  an  Initiate  message  to  all 
mservers  in  the  core-set.  The  core-set  mservers  respond  with  ACKs,  acknowledging 
reception  of  the  Initiate  message.  When  all  ACKs  have  been  received  by  the  coordinator, 
the  Commit  phase  is  begun  with  the  coordinator  multicasting  a  Commit  message  to  all 
core-set  mservers,  indicating  that  it  is  safe  to  commit  the  change.  All  change-processing 
messages  use  timeouts  and  retries  to  ensure  continual  progress  in  completing  the  change. 
The  procedures  Reliable_receive  and  ReUable_multi_receive  perform  these  functions. 
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Coordni^  Baac  duMfe 

/*  coordinator  has  been  identified  by  reoepdon  of  a  change  request  or  detection  of «  filed 
mserver*/ 

/*  Initiate  phase  */ 

1.  cuiTent_change=d)ange  data  (received  or  detected) 

2.  update  (exdiide_list,  internal  state) 

3.  fixmmessage  {Initiate,  currentjiiange,  epcckidejiat) 

4.  multicast  (Initiatejnessage,  {core_set  -  exckide_fist  -  coordrator}) 

5.  RdiiMe_multi_reoeive  (AGCrnessage,  lnkiate_mesag.  {core_set  -  exchide_list  - 
coordinator}) 

/*  Comnrat  phase  */ 

6.  update  (exdudejist.  internal  state,  cunentjchange) 

7.  fijrmjnessage  (Commit,  cunentjchange,  exdudejist) 

8.  multicast  (Commitjnessage,  {core_set  -  exdudejist  <  ooonfinator}) 

9.  update  Ontemal  state) 

10.  previousjdiange  =  cufrent_diange 

11.  group_view  =  group_view+ 1 


Figure  17.  Coordinator  Basic  Change  Protocol 


Non-coordinator  Bask  Chai^ 

/*  core-set  mserver  has  received  and  decoded /wdoa;  message  */ 
/*Iratiate  phase*/ 

1.  cunentjdiai^  =  diange  data  (received) 

2.  update  (exdudejist,  internal  state) 

3 .  foimniessage  (ACK,  cunentjchange,  exdudejist) 

4.  sendjmessage  (ACKjmessage,  comxiinator) 

5.  mes%==Rdable_reoeive(Conmit_message,Msg_Quety_message) 
6  if(messgNComnvt_message)  /*  fiuled  coordinator  */ 

7.  goto  Broadcast  Election  Protocol 
/*Coiranit  phase*/ 

8.  update  Cmtemal  state) 

9.  previous_d)apge  =  cunent_chaitge 

10.  group_view = group  view  + 1 


Figure  18;  Non-Coordinator  Basic  Change  Protocol 
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a.  Timeouts  and  Retries  For  Att  Messages 

The  need  for  timeouts  and  retries  of  messages  has  been  discussed  previously. 
The  function  Reliable  receive  accomplishes  this  for  unicast  messages,  as  described  in  the 
monitoring  protocol  section.  The  same  function  is  performed  by  the  procedure 
Reliable_multi_receive  when  multiple  responses  must  be  received,  as  shown  in  Figure  19 
The  algorithm  for  the  Reliable  multi  receive  is  listed  in  Figure  20. 


-timcout- 


-  timeout — X — timeout - 


Figure  19;  Message  Timeout,  Retries,  and  Failure  Detection 


Rcliable_iniilti_receive  (expected_mes%,  lastjnessg,  responders) 

/*LastjTiessg  has  been  salt,  now  adiect  expected jnes^  responses.  Modifies  the  sa  of 
responders  to  reflect  those  not  responding  */ 

1 .  sajima  (timeout) 

2.  retries  =  n 

3.  initialize  (all  responders  =  not_responded) 

4.  num_responders  =  numba  of  responders 

5.  responses  =  0 

6.  wt^((tima  not  ex|rired)  and  (responses  <num_responders)) 

7.  reoeive_mess{ige  (messg) 

8.  if((messg  =  expected_messg)and(responda  =  not_responded)) 

9.  responda  =  re^nded 

10.  responses  =  responses  + 1 

11.  re^nders=  {responders -(all  responding  responders)} 

12.  if  ((responses  <  num_responders)  and  (retries  >  0)) 

13.  retries  =  retries  - 1 

14.  multicast  (lastmessg,  responders) 

15.  sajtima  (timeout) 

16.  ^to6. 


Figure  20:  Reliable_multi_receive  Algorithm 
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Lines  I  and  2  of  the  Reliable  multi  receive  algorithm  initialize  the  timer  and 
number  of  retries.  Lines  3  through  S  initialize  the  set  of  responders  and  number  of 
responses.  Line  6  begins  the  main  loop  to  collect  responses  from  the  set  of  expected 
responding  mservers.  Line  7  is  the  timed  receive  function  described  in  the 
Reliable_receive  function.  Lines  8  through  10  determine  if  the  response  is  valid  and  not  a 
duplicate,  and  if  so,  mark  the  responding  mserver  as  having  responded.  Line  1 1  calculates 
the  new  set  of  expected  responding  mservers  after  the  loop  has  completed.  If  any 
mservers  have  not  responded  and  the  retries  have  not  been  exhausted,  lines  12  through  16 
initialize  for  another  timed  reception  loop  to  attempt  to  collect  the  remaining  responses 
At  the  end  of  the  procedure,  the  set  of  responders  has  been  reduced  to  only  those  who 
failed  to  respond.  These  mservers  will  be  declared  failed  by  the  calling  mserver;  in  this 
case,  by  the  coordinator  in  line  5  of  the  basic  change  protocol. 

Figure  21  is  an  event  diagram  showing  the  sequence  of  messages  in  the 
event  of  a  lost  or  delayed  ACK  message  from  a  non-coordinator  core-set  mserver.  After  a 
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m2 


m3 


m4 


m4  does  not  receive 
Initiate  (Cl) 


ni4  sends  ACK  (Cl) 


all  commit  Cl 


Figure  21;  Lost  or  Delayed  ACK  Message  During  Initiate  Phase 
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timeout  period,  the  coordinator  resends  the  Initiate  messi^e,  and  receives  AC K 

message.  The  core-set  then  commits  the  change  In  this  event  diagram,  the  multicast  of  a 
message  is  indicated  by  multiple  arrows  emanating  from  a  small  circle 

Figure  22  shows  a  similar  situation,  in  which  a  non-coordinator  mserver  has 
not  received  an  expected  Commit  message  from  the  coordinator  This  mserver  sends  a 
Msg  Query  message  to  the  coordinator,  querying  the  coordinator  about  the  missing 
Commit  message.  The  coordinator  realizes  that  the  querying  mserver  must  not  have 
received  the  original  Commit  message,  so  it  resends  the  message  The  core-set  then 
completes  the  change. 


ml 


m2 


mS 


m4 


Figure  22.  Lost  or  Delayed  Commit  Message 


Figure  23  shows  the  sequence  of  events  when  a  non-coordinator  mserver  is 
unable  to  receive  from  the  coordinator,  m2  is  unable  to  receive  the  Initiate  message  from 
coordinator  ml,  and  after  timeouts  and  retries,  the  coordinator  declares  m2  failed.  While 
the  coordinator  was  waiting  to  receive  an  ACK  message  from  m2,  it  received  a 


52 


Msg  Query  message  from  m3  and  m4,  querying  the  coordinator  about  the  expected 
Commit  message.  The  coordinator  responds  with  a  Wait  message,  telling  m3  and  m4  that 
the  coordinator  is  still  collecting  ACKs,  and  will  send  the  Commit  message  when  done. 
The  use  of  the  Msg  Query  and  Wait  message  is  described  in  the  next  section.  After  the 
coordinator  has  detected  m2  failed,  it  sends  the  Commit  message  with  gossip  about  m2's 
failure  to  m3  and  m4,  completing  the  change  and  informing  them  that  m2  has  failed. 


ml  m2  m3  m4 


Figure  23;  Lost  or  Delayed  ACK  Message  During  Initiate  Phase 
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b.  Virtually  Simultaneous  Changes 

The  basic  change  protocol  ,  listed  in  Figures  1 7  and  1 8  for  the  coordinator 
and  non-coordinator,  respectively,  is  unable  to  resolve  the  virtually  simultaneous  changes 
discussed  previously.  The  Reliable_receive  and  Reliable_multi_receive  functions  used  by 
the  basic  change  protocol  are  only  capable  of  receiving  the  expected  message  or  declaring 
the  non-responding  mserver  or  mservers  as  failed  They  are  not  able  to  handle  unexpected 
messages,  including  an  Initiate  message  from  another  mserver  attempting  to  initiate  a 
change.  To  allow  for  the  occurrence  of  simultaneous  changes  and  other  unexpected 
messages,  the  Reliable  receive  and  Reliable_multi_receive  functions  were  augmented  to 
cover  all  possible  unexpected  messages.  These  augmented  versions  are  listed  in  Figures 
24  and  25. 

The  augmented  Reliable  receive  function  has  the  same  name  and  is  called 
with  the  same  parameters  as  the  simpler  version  in  Figure  15.  The  new  version  is  used  in 
place  of  the  simpler  version  in  line  5  of  the  non-coordinator  basic  change  protocol.  The 
modified  or  added  lines  to  the  algorithm  are  underlined  in  Figure  24.  The  new  version  is 
used  by  a  non-coordinator  core-set  mserver  waiting  for  a  Commit  message  from  the 
coordinator,  by  an  MI  or  non-core-set  mserver  submitting  an  application  change  and 
waiting  for  a  Direct  message,  and  by  the  monitor  mserver  waiting  for  a  Reply  message.  A 
detailed  description  of  the  augmented  Reliable  receive  function  follows. 

Lines  5  through  8  in  the  new  algorithm  detect  an  overlapping  change 
initiated  by  an  mserver  that  already  completed  the  current  change.  A  Wait  message  is  sent 
to  the  attempting  mserver  to  postpone  initiation  of  the  new  change  until  the  old  change  is 
completed.  An  example  of  this  situation  is  shown  in  Figure  26.  Lines  9  and  10  detect  the 
failure  of  the  coordinator  and  call  the  election  protocol,  which  will  be  described  in  the  next 
section.  Lines  1 1  and  12  detect  a  virtually  simultaneous  change  initiation  of  higher 
priority.  The  mserver  drops  the  current  change  and  begins  processing  the  new  change 
An  example  of  thi';  situation  is  shown  in  Figure  27.  Lines  13  through  15  detect  the 
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Rcfiable_raceive  (expectedjnessg,  querymessg) 

I*  Augmaited  Rduible_reccive  used  by  mserver  to  rdiably  receive  Commit  from 
coordmator  or  Direct  from  paient-mserver.  Handies  relevant  unexpected  messages. 
Returns  the  received  message  */ 

1  setjtimer  (timeout) 

2.  retries  =  n 

3.  while  ((timer  not  expired)  and  (mes^  !=  expected  rnessg)) 

4.  receive_message(messg) 

S  if(messg  =  Wtiale  inessageor  Coord  Fail  inessage) 

^  if(tnessg  is  from  next  change)  /*  ovedap  from  next  view '"/ 

T  formmessage  ( Wait,  currentjchange,  exdudelist) 

fr  send_message  (Wahjnessage,  responder) 

9.  if  (mes%  =  Coord_Fail_message  of  higher  priority  or  current  coordinator) 

10.  goto  Broadcast  Ekction  Protocol 

11.  if  (mes^  ==  Iiiitiate_message  of  hitter  priority  or  current  coordinator) 

12.  goto  Non-Coordinator  BaM  Change  Protocol 

13.  if  (mes%  =  Wait  jnessage)  /*  response  to  Msg  Query  */ 

14.  wait  (wahtimeout) 

15.  goto  1 .  /*  restart  Rdiable  receive  for  expected  rnessg  */ 

16.  if  (messg  =  MsgL.Quety_*’'®ssage)  /*  other  mserver  querying  status  */ 

17.  if  (current  change)  /*  other  msaver  nwst  wait  for  next  message  */ 

18.  fotm_message  ( Wait,  current_change,  excludejist) 

19.  send_message  (Waitjtnessage,  responder) 

20.  if  ((previous  unfinished  change)  and  (mygid  =  previous  coordinator)) 

21.  form  message  (Commit,  curretrt  diange,  exdude  list) 

22.  sendmessage  (Commitmessage,  responder) 

23.  if  ((mes%  !=  expected_tnessg)  and  (retries  >  0)) 

24.  retries  =  retries  - 1 

25.  sendmessage  (queryrnessg,  destination) 

26.  settimer  (timeout) 

27.  goto  3. 

28.  return  messg 

Figure  24;  Augmented  Reliable  receive  Algorithm 
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reception  of  a  Wait  message  in  response  to  a  MsgjQuery  message  sent  previously.  The 
mserver  waits  for  a  period,  then  restarts  the  Reliable  _recei\  j  An  example  of  this  action 
is  shown  in  Figure  26  Lines  16  through  22  detect  a  received  Msg  Query  message  and 
perform  the  necessary  actions.  If  the  Msg  Query  is  about  the  current  change,  it  is  from  an 
Ml  or  child  mserver  waiting  for  a  Direct  message  The  querying  MI  or  mserver  is  sent  a 
Wait  message  to  stall  their  reception  of  the  expected  message  If  the  Msg  Query  is  about 
the  previous  change  and  the  receiving  mserver  was  the  coordiiuitor  of  the  previous 
change,  then  the  querying  mserver  did  not  receive  the  Commit  message.  The  receiving 
mserver  sends  a  Commit  message  to  complete  tnge.  An  example  of  this 

situation  is  shown  in  Figure  26.  The  remaining  lines  of  the  function  are  the  same  as  the 
original,  and  perform  the  primary  function  of  receiving  the  expected  message  within  the 
timeout  period. 

The  augmented  Reliable_multi_receive  function  has  the  same  name  and  is 
called  with  the  same  parameters  as  the  simpler  version.  The  augmented 
Reliable_multi_receive  function  is  used  in  place  of  the  simpler  version  in  line  5  of  the 
coordinator  basic  change  protocol  shown  in  Figure  17.  The  modified  or  added  lines  to  the 
algorithm  are  underlined  in  Figure  25.  The  new  version  detects  unexpected  messages 
received  while  collecting  ACK  messages,  and  responds  to  them  appropriately.  A  detailed 
description  of  the  Reliable  multi  receive  function  follows. 

Lines  8  and  9  detect  a  simultaneous  change  of  higher  priority.  The 
coordinator  stores,  then  drops  the  current  change,  ceases  to  be  a  coordinator,  and  begins 
processing  the  new  change.  An  example  of  this  action  is  shown  in  Figure  27.  Lines  10 
and  1 1  detect  the  failure  of  the  current  coordinator  or  a  new  coordinator  for  a  virtually 
simultaneous  change  of  higher  priority  than  the  current  change,  and  call  the  election 
protocol,  which  will  be  described  in  the  next  section.  A  received  Coord  Fail  message  for 
a  change  of  lower  priority  is  quietly  discarded.  Lines  12  and  13  detect  the  reception  of  a 
Wait  message  sent  by  an  mserver  still  processing  the  previous  change.  The  coordinator 
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Rdiable_niulli_raccive  (e9q)ected_inessg,  biA_nies%  responders) 
f*  Augmoted  Rdttbfe_multi_receive,  used  after  last_messg  is  multicast  to  relid}ly 
collect  all  responses  fiom  other  mservers  or  Mis.  Handles  refevant  unexpected  messages. 
Modihes  the  set  of  responders  to  redect  those  not  responding*/ 

1 .  settimer  (timeout) 

2.  retries  =  n 

3.  initialize  (all  responders  =  not  responded) 

4.  num_re^x)nders==mjmberofresp(MKlers 

5.  re^nses  =  0 

6.  while  ((timer  not  expired)  and  (responses  <  num  responders)) 

7.  receivemessage  (messg) 

^  if  (messg  =  Initiate  message  of  higher  priority  than  cunent  change) 

^  goto  Non-Coordinator  Basic  Change  Protoad 

10.  if  (messg  — C:oord_Fail_message  of  higher  priority  than  current  change) 

11.  goto  Broadcast  Election  Protocol 

12.  if  (messg  =  Waitjtiessage  fixwn  previous  change)  /*  gtoupview  - 1  */ 

13.  wait  (wait_timeout) 

14.  goto  Coordinator  Basic  Change  Protocol  /*  restart  diange  */ 

15.  if(mes^  =  Msg_Query_message)  /* other mserverqueifying status*/ 

16.  if  (current  change) 

17.  if  (responder  did  not  receive  Iast_messg) 

18.  send_message  (last_messg,  reqxmder) 

19.  else  /*  responder  already  received  last_messg,  so  must  wait  */ 

20.  form_message  (War,  currentjchaiige,  exdudejist) 

21.  sendjnessage  (Wait jnessage,  responder) 

if  ((previous  unfinished  chatige)  and  (ntygid  =  previous  cotxdinator)) 
formrnessage  (Commit,  currentdtaiige,  exdudejist) 

24.  sendmessage  (Commitjnessage,  responder) 

25.  if((messg  =  e7q}ected__message)and(re^nder  =  not_re^nded)) 

26.  re^nder = responded 

27.  responses  =  responses  +  1 

28.  responders  =  {responders -(all  responding  responders)} 

29.  if  ((responses  <  num  resportders)  and  fretries  >  0))  /*  more  responses  to  collect  */ 

30.  retries  =  retries  - 1 

3 1 .  set  timer  (timeout) 

32.  multicast  (last  messg,  senders)  /*  resend  original  message  */ 

33.  goto  6 


Figure  25:  Augmented  Reliable_multi_receive  Algorithm 
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will  perform  a  wait  timeout,  the  resume  the  current  change  An  example  of  this  is  shown 
at  the  top  of  Figure  26.  Lines  IS  through  21  detect  a  received  Msg  Query  message  and 
perform  the  necessary  actions.  If  the  Msg  Query  is  about  the  current  change  and  the 
querying  mserver  did  not  receive  the  Initiate  or  Direct  message  sent  prior  to  the 
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m2  coordinator  for  C2 
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timeout 


message 

timeout 
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ml  begins  wait  timeout 
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new  Commit  (Cl)  . 

waK  (C2) 
timeout 

m2  finishes  wait  timeout, 
resends  Initiate  (C2) 


all  arc  now  processing  C2 


Ws'i 


ml  and  ni2  finish  Cl 
Commit  to  m4  is  lost 

m4  and  m3  wahiagfor 
Commit  for  Cl,  send 
Wait  (C2)  to  m2 

ni3  finishes  Cl 


m4  times  out  waiting  for 
Commit  from  ml,  sends 
Msg.Queiy(Cl) 


m4  finishes  Cl 


Figure  26;  Resolution  of  Overlapping  Changes 


Reliabie_multi_receive  call,  the  mserver  resends  the  appropriate  message.  An  example  of 
this  situation  is  shown  in  Figure  26.  If  the  querying  mserver  did  receive  the  prior  message, 
then  the  Msg  Query  is  about  a  new  change  which  has  been  started  before  the  old  change 
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completed.  The  coordinator  sends  a  message  to  stall  the  processing  of  the  next 
change  until  the  last  change  is  completed.  An  example  of  this  is  shown  in  Figure  23.  If 
the  Msg  Query  is  about  the  previous  change  and  the  receiving  mserver  was  the 
coordinator  of  the  previous  change,  then  the  querying  mserver  did  not  receive  the  Commit 
message  The  receiving  mserver  sends  a  Commit  message  to  complete  the  last  change 
An  example  of  this  situation  is  shown  in  Figure  26.  The  remaining  lines  of  the  function  are 
the  same  as  the  original,  and  perform  the  primary  function  of  collecting  the  expected 
response  message  within  the  timeout  period. 

The  event  diagrams  in  Figures  26,  27.  and  28  show  various  occurrences  of 
concurrent  changes  attempting  to  be  processed  in  a  core-set  at  one  time.  Since  the  MS 
guarantees  that  only  one  change  at  a  time  will  ever  progress  to  completion,  a  method  of 
resolving  the  various  concurrent  change  attempts  must  be  used.  The  algorithms  in 
Reliab]e_recei\'e  and  Reliable  multi  receive  provide  the  necessary  capability  to  resolve 
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Figure  27;  Resolution  of  Virtually  Simultaneous  Changes 
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concurrent  changes  to  a  single  change  Figure  26  shows  one  case  where  an  mserver  m2 
has  completed  the  previous  change  C I  and  immediately  initiates  the  next  change  C2.  This 
is  a  very  likely  occurrence,  since  each  mserver  will  queue  failures  and  changes,  waiting  for 
the  next  opportunity  to  initiate  them.  Mserver  m4  did  not  receive  the  Commit  message  for 
Cl,  so  when  it  receives  the  Initiate  message  for  C2  it  sends  a  Wait  message  to  the  new 
coordinator  m2  to  postpone  the  new  change  C2  until  the  old  change  Cl  is  completed.  m4 
times  out  waiting  for  the  Commit  message  for  Cl  and  sends  a  Msg  Query  to  ml,  which  is 
now  processing  C2.  ml  receives  m4's  Msg  Query  while  using  the  Reliable_rece*ve 
function  and  sends  m4  another  Commit  message.  m4  is  now  able  to  complete  C 1  After 
m2  finishes  the  wait  timeout,  it  resumes  C2. 

Figure  27  shows  t\.'o  core-set  mservers  attempting  to  initiate  virtually 
simultaneous  changes.  All  core-set  mservers  receive  both  Initiate  messages,  although 
perhaps  in  different  order  due  to  the  asynchronous  environment.  All  core-set  mservers 
have  the  same  core-set  state  information,  and  therefore  make  exactly  the  same  decision 
about  which  change  has  priority,  in  tlus  case,  C 1  has  priority,  and  is  therefore  recognized 
as  the  change  to  be  processed  by  ail  core-set  mservers,  while  C2  is  dropped  by  all.  The 
candidate  coordinators  must  collect  ACKs  from  all  other  core-set  mservers,  including  the 
other  candidate  coordinator,  before  sending  the  Commit  message  The  candidate 
coordinators  make  the  same  decision  about  which  change  has  priority;  therefore,  only  one 
coordinator  will  be  selected  and  only  one  change  will  be  processed.  If  all  core-set 
mservers  receive  all  Initiate  messages,  it  is  impossible  for  more  than  one  change  to 
progress  to  completion  in  the  core-set  It  will  be  shown  later,  that  under  non-ideal 
circumstances,  some  core-set  mservers  may  have  failed  or  do  not  receive  all  messages, 
leading  to  a  situation  where  more  than  one  change  is  being  processed  in  the  same  core-set. 
However,  it  will  also  be  shown  that  if  this  occurs,  the  core-set  will  always  partition  in  such 
a  manner  that  all  core-set  mservers  in  the  partition  will  be  processing  the  same  change,  and 
will  arrive  at  the  same  consistent  view. 
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Figure  28  is  a  combination  of  the  overlapping  change  shown  in  Figure  26 
and  the  virtually  simultaneous  changes  shown  in  Figure  27.  This  event  diagram  shows 
that  even  under  the  circumstances  when  changes  overlap  at  the  beginmng  and  «id  of  a 
change,  only  one  change  at  a  time  progresses  to  completion. 
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Figure  28:  Virtually  Simultaneous  and  Overlapping  Changes 
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5.  Coordinator  failure 


The  basic  change-processing  protocol  assumes  that  the  coordinator  of  the  change 
will  continue  to  function  throughout  the  processing  of  the  change.  The  protocol 
definitions  and  examples  to  this  point  handle  various  situations,  including  the  failure  of 
non-coordinator  core-set  mservers.  However,  it  is  entirely  possible  that  the  coordinator  of 
a  change  may  fail  or  be  unable  to  communicate  with  others  during  the  processing  of  a 
change  In  the  event  of  the  coordinator  ^lure,  a  new  coordinator  nmst  be  elected 
Birman  and  Riccardi  [9]  have  proven  that  when  the  coordinator  of  a  change  can  fail,  a 
three-phase  change  protocol  is  required.  To  this  end,  another  phase  must  be  added  to  the 
two-phase  basic  change  protocol.  This  phase  is  a  broadcast  election  phase,  as  described  in 
[26],  which  is  conducted  to  elect  a  new  coordinator  to  resume  the  original  change  being 
processed.  After  the  distributed  election  is  accomplished,  the  new  coordinator  will  restart 
the  original  change  with  a  new  Initiate  message  to  all  surviving  core-set  mservers,  or, 
under  special  circumstances,  will  simply  send  a  Commit  message  to  complete  the  change. 
An  illustration  of  the  three-phase  election  and  change  processing  protocol  is  shown  in 
Figure  29,  and  the  algorithm  or  the  broadcast  election  is  listed  in  Figure  30. 
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Figure  29;Election  and  Change-processing  Protocol 
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/^Coordmator  Mure  has  been  detected.  Elect  new  coordinator  */ 

1.  updi^(e3(ciude_Kst,inten)al  state) 

2.  formmessage  {CoordFaH,  currentchange,  exciude_list) 

3.  multicast  (Coord_Fnljnessage,  {corejset  -  exclude_list  -  coordinator}) 

4.  Reiiabie_muiti_reoeive  (Coord_Fai]_message,  Coord_Fail_niessage,  { core_set  - 
exciude_li^  -  coordinator}) 

5.  update  (exclude_list,  internal  state) 

/*  determine  new  coordinator  from  respondii^  mservers  */ 

6.  coordinator  =  highest  rank  mserver  with  current  change  active 

7.  Resume_chBnge(cuiTettt_change,  coordinator) 


Figure  30;Broadcast  Election  Protocol 

The  broadcast  election  protocol  is  commenced  upon  detection  by  one  or  more 
mservers  of  the  failure  of  the  current  coordinator.  This  detection  could  occur  by 
monitoring  or  by  the  timeout  of  an  expected  message  while  using  the  Reliable  receive 
function.  The  detecting  mserver  will  multicast  a  Coord  FaH  message  to  all  other  core-set 
mservers,  which  will  include  the  status  of  the  mserver  in  processing  the  current  change. 
This  mserver  will  then  collect  responses  from  all  other  mservers  with  the 
Reliable_multi_receive  function.  The  other  mservers  receiving  the  Coord  Fail  message 
will  also  multicast  their  status  and  collect  responses  from  all  others.  In  this  way,  all 
core-set  mservers  learn  of  the  status  of  all  other  core-set  mservers  with  respect  to  the 
interrupted  change.  These  steps  are  covered  by  lines  1  through  5  in  Figure  30.  In  order 
for  an  mserver  to  become  the  new  coordinator,  it  must  have  received  the  original  Initiate 
message,  but  not  yet  have  committed  the  change.  Only  an  mserver  still  processing  the 
change  will  have  sufficient  information  to  restart  the  change.  There  is  guaranteed  to  be  at 
least  one  such  mserver,  since  only  an  mserver  still  processing  the  change  could  determine 
that  the  coordinator  had  failed.  Since  all  mservers  have  learned  the  status  of  all  other 
mservers,  a  uniform  distributed  decision  can  be  made  by  all  as  to  the  identity  of  the  new 
coordinator.  To  select  the  new  coordinator  from  those  mservers  still  processing  the 
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current  change  a  priority  scheme  is  used.  The  mserver  with  the  highest  rank  in  the 
core-set,  still  processing  the  original  change,  will  become  the  new  coordinator  All 
core-set  mservers  know  the  rank  of  all  other  core-set  mservers,  so  they  all  make  exactly 
the  same  distributed  decision.  Thus,  a  sin^e  new  coordinator  is  chosen 

Once  the  new  coordinator  is  chosen,  it  will  resume  the  original  change  using  the 
two-phase  basic  change-processing  protocol,  as  shown  in  Figure  29.  However,  if  at  least 
one  core-set  mserver  has  committed  the  change,  then  it  is  safe  for  the  coordinator  to 
immediately  multicast  a  Commit  message  to  have  all  core-set  mservers  commit  the  change. 
This  is  possible  due  to  the  fact  that  in  order  for  any  mserver  to  have  received  a  Commit 
message  from  the  failed  coordinator,  that  coordinator  must  have  received  ACKi  from  all 
surviving  core-set  mservers.  This  means  that  all  mservers  in  the  core-set  have  knowledge 
of  the  change,  and  can  therefore  commit  the  change.  Any  mserver  that  did  not  have 
knowledge  of  the  change  would  have  been  detected  failed  by  the  old  coordinator,  using 
the  Reliable_multi_receive  procedure.  The  old  coordinator  would  include  all  detected 
failures  in  the  excludejist  added  to  each  multicast  message,  and  thus  any  mserver 
receiving  the  Commit  message  would  learn  of  the  detected  failure  of  all  mservers  which 
had  not  received  the  original  Initiate  message.  The  mserver,  learning  by  gossip  of  the 
failure  of  other  mservers,  would  cease  to  communicate  with  them.  These  excluded 
mservers  will  be  removed  from  the  core-set,  so  that  only  mservers  which  had  received  the 
original  change  remain. 

Figure  3 1  shows  the  event  dit^am  for  the  compressed  election  and 
change-processing  protocol  described  above.  Figure  32  is  the  listing  for  the 
Resume  change  function  used  in  Line  7  of  the  broadcast  election  protocol.  This  function 
makes  the  decision  for  the  new  coordinator  whether  to  restart  the  original  change  with  an 
Initiate  message  or  use  the  compressed  protocol  and  simply  multicast  a  Commit  message 
to  complete  the  original  change. 
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Pkaael  PkMeD 

EkcHom  Comumk 


Figure  31 :  Compressed  Election  and  Change-processing  Protocol 


Resuiiie_cluu^ 

/*  Following  broadcast  election  of  new  coordinator.  */ 

1.  if(anyoore_setmserverhasootnnnttedd)eduu^) 

/*  then  use  compressed  change  protocd  -  send/recetve  Comm// only  */ 

2.  if  (coordinator) 

3.  form_message  (Comnit,  cunentjchange;  exdudejist) 

4.  multicast  (Commit_message,  {core  set  -  exdude_list  -  coordinator}) 

5.  else  /•nonjooordinator  */ 

6.  mes%  =  Rdiablejeoeive  (ComtM_message,  M^Queiy_message) 

7.  else  /*  no  nfiservers  have  cominitted  the  change -inust  restart  V 

8.  Reinitiate  (cutTent_change) 

Figure  32;  Resume  Change  Algorithm 

Examples  of  various  scenarios  involving  the  failure  of  the  current  coordiruUor  are 
shown  in  five  event  diagrams  on  the  following  pages.  These  examples  illustrate  some  of 
the  more  likely  scenarios  which  might  be  encountered  when  a  coordinator  is  detected 
failed,  and  the  sequence  of  events  leading  to  the  election  of  a  new  coordinator  and 
completion  of  the  original  change. 

Figure  33  shows  the  sequence  of  events  when  the  coordinator  foils  in  the  Initiate 
phase,  immediately  after  multicasting  the  Initiate  message.  All  other  core-set  mservers 
time  out  waiting  for  the  Commit  message,  detect  the  coordinator  failed,  and  conduct  an 
election  for  a  new  coordinator.  The  new  coordinator  completes  the  original  change. 
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ml  m2  m3  m4 


ml  coordinator  for  Cl 
coordinator  faib 


coordinator  detected 
failed 

Election  Phaie 


A 


m2,  ni3,  and  m4  begin 
meesage  tbneout 


T  all  thac  out,  query 
T  coordinator 


T  aU  time  out,  query 
T  coordinator  again 


,  m4  timee  out  flnt, 
■endt  Coord_Fail 

m2  and  m3  lend 
Coord.FaU 

m2  haa  higheat  rank, 
bccomea  coordinator, 
and  re>lnUiatea  Cl 


Figure  33;  Coordinator  Failure  During  Initiate  Phase 

Figure  34  illustrates  the  case  where  the  monitor  of  the  coordinator  is  unable  to  receive 
from  the  coordinator.  The  monitor  m2  detects  the  coordinator  ml  failed  by  monitoring 
and  initiates  a  change  C2  for  the  failure  of  m/  by  multicasting  an  initiate  message  to  all 
core-set  mservers.  The  other  core-set  mservers  are  already  processing  the  change  Cl. 
The  change  C2  is  recognized  by  m3  and  md  as  a  failure  of  the  current  coordinator; 
however,  it  is  also  a  virtually  simultaneous  change,  since  no  mserver  has  conunitted  Cl. 
For  this  reason,  C2  is  treated  as  a  virtually  simultaneous  change  of  higher  priority  than  Cl, 
avoiding  the  need  to  elect  a  new  coordinator.  Since  the  failed  coordinator  initiated  the 
original  change  and  then  failed,  there  is  no  need  to  resume  processing  of  this  change.  If  an 
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ml  m2  m3  m4 


mJ  and  ni4  rcceKc 
Initiate  (Cl),  ni2  docs  not 
(GV=n) 

ni2  detects  ml  failed  by 
monitoring,  sends 
Initiate  (C2)  for  ml's  failure 
(GV-n) 

m3  and  m4  realize  C2's 
subject  is  the  coordinator, 
drop  Cl  and  begin  C2 

m2,  m3,  and  m4  commit 
C2,  ml's  failure 


m3  and  m4  ignore  message 
from  ml 


Figure  34;  Coordinator  Failure  With  Lost  Initiate  Message 

application  group  submitted  the  change  to  ml,  the  group  will  be  partitioned  aim  I  anyway, 
so  there  is  no  need  to  process  the  submitted  change  on  the  other  side  of  the  partition.  If 
/ra/'s  change  was  a  physical  change  about  a  core-set  mserver,  it  will  either  be  redetected 
and  processed,  or  perhaps  will  remedy  itself 

Figure  35  illustrates  the  case  where  the  coordinator  is  detected  failed  in  the 
Commit  phase,  after  one  or  more  mservers  have  received  the  Commit  message.  The 
core-set  mservers  conduct  a  broadcast  election  in  which  m2  becomes  the  new  coordinator. 
Since  m4  committed  the  original  change,  the  compressed  change  protocol  is  used,  and  m2 
multicasts  a  Commit  message  to  finish  the  change. 


67 


ml  m2  m3  m4 


m2  detects  ml  failuK  by 
monitoring,  sends 
Coord_Fail 

ni4  commits  Cl 

m3  and  m4  ieam  of  ml's 
failure,  ignore  messages 
from  ml 

m2  has  Ughest  rank, 
becomes  new  coordinator, 
finishes  Cl  with  Commit 


Figure  35:  Coordinator  Failure  In  Commit  Phase 


Figure  36  illustrates  an  unusual  case  where  the  failure  of  the  coordinator  and  lost 
messages  lead  to  one  core-set  mserver  committing  the  change  C I  (mi),  one  mserver  still 
processing  the  change  im4),  and  one  mserver  never  having  received  the  change  (m2)  As 
a  result  of  this  situation,  m4  will  become  the  new  coordinator,  since  it  is  the  only  mserver 
still  processing  C 1 .  Mserver  mi  learned  of  m2's  detected  failure  with  the  Commit  message 
received  from  the  original  coordinator  ml.  The  end  result  is  that  mi  and  m4  commit  Cl 
and  reform  into  a  new  core-set,  while  m2  never  learns  of  C 1 ,  and  is  excluded  from  the 
original  core-set. 

The  final  event  diagram  shown  in  Figure  37  shows  a  situation  in  which  the 
coordinator  has  failed  after  multicasting  an  Initiate  message  which  was  received  by  only 
one  core-set  mserver.  Another  core-set  mserver,  m4,  also  initiated  a  virtually 
simultaneous  change  of  lower  priority,  ml  receives  the  Initiate  message  for  both  changes 
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ml  coordinator  for  Cl 
bcidns  ACK  timeout 


ml  times  out  on  m2's 
ACK,  resends  Initaite 

ml  sends  Wait  (Cl) 
to  m3  and  m4 

ml  times  out  on  m2's 
ACK,  resends  Initaite 


AOytii 


Partition  between  ml 
and  m2  exists 

m3,  and  m4  begin 
message  timeout 

m3  and  m4  time  out, 
k  query  coordinator 


m3  and  m4  begin  timed 
wait  for  ml's  Commit 


ml  detects  m2  failed,  ^ 
sends  Commit  (Cl)  with 
gossip  about  m2's  failure 

previous  coordinator 
fails 


coordinator  detected 
failed 

m2  sends  Coord_Fail 
indicating  it  had  never 
received  Cl.  None  are 
communicating  with  m2, 
m2  times  out  on  Commit, 
detects  m4  failed,  and  in 
processing  m4's  failure, 
detects  m3  failed,  m2  ^ 
reforms  as  singleton  set 


CoorOKsH^ 


m3  commits  Cl,  learns 
of  m2's  failure.  Message 
to  m4  lost,  still  waiting 

f  m4  times  out  on  wait  for 
^  Commit  from  ml,  sends 
Msg_Query,  retries,  then 
detects  ml  failed 

f 

m3  sends  Coord_Fai) 
to  m4  only,  since  it 
detected  m2  failed, 
m4  learns  of  m2’s  failure 

m4  b  highest  rank  mserver 
still  processing  Cl,  sends 
Commit  (Cl)  since  m3  has 
already  finbhed  Cl 


Figure  36:  Coordinator  Failure  With  Lost  Messages 
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ml 


m2 


m3 


m4 


m2  coordinator  for  Cl 

coordinator  fails 

ml  bcfins  mcssaite 
timeout 


ml  times  out,  sends 
Query  to  coordinator 


ml  times  out,  sends 
another  Queiy 


ml  times  out,  detects 
coordinator  failed, 
sends  a  Coord  Fail 


only  ml  ftets  Initiate  (Cl) 

m4  coordinator  for  C2 
begins  ACK  timeout 

m3  sees  only  C2, 
ml  sees  Cl  and  C2, 

Cl  has  priority 

m3  detects  ml's  failure 
by  monitoring,  queues 
the  failure 


m4  times  out  on  ACKs, 
resends  Initiate  (C2)  to 
ml  and  m2 


J.  m3  and  m4  learn  of  Cl, 
drop  C2,  note  m2's  failure, 
begin  election  for  Cl 


Figure  37;  Coordinator  Failure  With  Simultaneous  Change 

and  decides  that  C 1  has  priority,  and  therefore  drops  C2,  assuming  that  all  other  core-set 
mservers  will  make  the  same  decision.  However,  m3  and  m4  did  not  receive  Cl,  so  they 
continue  to  process  C2.  ml  eventually  times  out  on  the  Commit  message  expected  from 
the  m2,  detects  the  coordinator  failed,  and  multicasts  a  Coord  Fail  message  to  all  core-set 
mservers.  m3  and  m4  now  learn  of  the  higher  priority  change  C 1 .  Since  no  mservers  had 
committed  C2,  the  change  is  dropped  and  m3  and  m4  begin  processing  Cl  with  an 
election  for  a  new  coordinator. 
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D.  APPLICATION  GROUP  CHANGE  PROCESSING 

The  general  description  of  application  group  change  processing  has  already  been 
presented  in  previous  sections.  In  this  section,  the  protocols  necessary  to  submit  a  change 
to  the  application  core-set  and  then  reliably  propagate  the  core-set  change  directive  back 
to  the  application  are  presented.  These  protocols  are  divided  into  the  algorithms  used  by 
Mis  submitting  or  receiving  a  change  directive,  and  those  used  by  mservers  in  the 
hierarchy  or  in  the  core-set  of  the  application.  Figure  38  shows  the  basic  application 
change  protocol. 


Submit  Initiate  Commit  Direct 


2nd-level  I 
core-set  / 
mservers  1 

coordinator^ 
Ist-levd  mserver 
submitting  Ml 


time 


Figure  38;  Application  Group  Change  Protocol 


1.  Mis 

The  MI  accepts  change  requests  from  the  application  groups  that  it  supports  and 
relays  these  requests  to  the  LAN  mserver  for  submission  to  the  application  core-set.  The 
Ml  may  also  detect  application  process  members  failed  and  submit  these  changes  as  well. 
The  algorithms  used  by  an  MI  are  listed  in  Figures  39  and  40,  for  the  submitting  MI  and  a 
non-submitting  MI,  respectively. 
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Submitting  Ml  Bask  Application  Change 

I*  MI  received  change  request  or  detected  change  in  an  application  group  *! 
I*  Submit  phase  */ 

1.  current_change  =  application  charge  data  (received  or  detected) 

2.  update  (internal  state) 

3 .  foim  tnessage  (Submit,  current  change) 

4.  sendmessage  (Submit_message,  paient_mserver) 

5 .  messg  =  Reliable_receive  (Directjmessage,  Msg_<3ueiy_message) 
t*  Direct  phase  */ 

6.  formmessage  (ACK,  cumentchange) 

7.  sendmessage  (ACK_message,parent_mserver) 

8.  update  (internal  state) 

9.  application jgroup_view  =  application _group_view  +  1 
10  reliably  inform  application  of  chan^ 


Figure  39;  Submitting  MI  Application  Group  Change  Protocol 


Non-submitting  MI  Bask  Application  Change 

t*  MI  received  change  Direct  message  from  parent  inserver  */ 
/*  Direct  phase  */ 

I .  formtnessage  (ACK,  currentchange) 

2  send_message(ACK_message,  parentmserver) 

3.  update  Ontemal  state) 

4.  apfdication _group_view  =  application jgroupview  +  1 

5  rdiaUy  inform  application  of  change 


Figure  40  Non-submitting  Ml  Application  Ooup  Change  Protocol 
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2.  Mservers 

The  LAN  mserver  accepts  application  change  requests  and  failures  submitted  by 
the  Mis  running  on  the  host  computers  of  the  LAN  These  changes  are  then  submitted  up 
the  MS  hierarchy  of  mservers  to  the  application  core-set,  where  the  change  is  processed 
Once  the  core-set  commits  the  change,  all  core-set  mservers  with  application  members 
below  them  multicast  the  change  directive  to  their  children  mservers  with  application 
members  below  them.  At  each  level  an  ACK  is  sent  to  the  parent  mserver  to  ensure 
reliable  delivery  of  the  change  directive  to  all  application  member  processes.  The  change 
directive  is  propagated  to  each  MI  with  members  of  this  application,  which  then  inform 
the  application  members  of  the  completed  change.  The  algorithms  used  by  mservers  are 
listed  in  Figures  4 1  and  42,  for  the  non-core-set  mserver  and  application  core-set  mserver, 
respectively. 


Non-core-set  Mserver  Basic  Application  Change 

/*  mserver  recdved  message  rdayed  from  submitting  MI;  will  rdiably  rday  to 
parentmserver  V 
/*  Submit  phase  */ 

1 .  sendinessage  (Submittnessage,  parent_mserver) 

2.  messg  =  RdiaWereceive  (Directinessage,  Msg_Quety_message) 

3.  if  (messg  !=  DirBct_message)  /*  foiled  parent_mserver  */ 

4.  goto  Broadcast  Election  Protood 

5.  update  (internal  state) 

/*  Direct  phase  */ 

6.  form  inessage  (ACK,  cunent  change) 

7.  send_message  (ACKtnessage,  parentmserver) 

8  form  inessage  (Direct,  currentjchange,  exchidejist) 

9.  multicast  (Directjmessage,  {children  with  application  members  -  excluded}) 

1 0.  Rdiable_multi_receive  ( ACK  inessage,  Direct_message,  { diikfren  with  application 
men^)ers  -  excluded }) 

11.  update  (internal  state) 

Figure  41:  Non-core-set  Mserver  Application  Gi  oup  Change  Protocol 
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Core-set  Mservcr  Basic  Application  Change 

/*  core-set  mserver  learned  of  api^ication  change  by  Submit  message  relayed  from 

submitting  MI  or  Initiate  message  from  application  chaise  coordinator  *! 

1  execute  Bask  Change  Protocol 

2  fonn  niessage  (Direct,  cument  change,  exclude  list) 

3  multicast  (Direct_message,  {duldien  with  application  members -excluded}) 

4  Reliable_mu)ti_receive  (ACK  message,  Direct  message,  { children  with  applicatitm 
members  -  excluded}) 

5  update  (exclude  jist,  internal  state) 

Figure  42:  Core-set  Mserver  Application  Group  Change  Protocol 

Figure  43  is  an  event  diagram  showing  the  actions  when  a  Submit  message  is  lost. 
In  this  case,  the  Submit  message  is  lost  between  the  LAN  mserver  and  the  core-set 
mserver.  The  MI  times  out  waiting  for  the  Direct  message  from  the  LAN  mserver  and 
sends  a  Msg  Query.  The  LAN  mserver  resends  the  Submit  message  to  the  core-set  and 
also  sends  a  Wait  message  to  the  querying  MI,  indicating  that  the  mserver  is  still  pursuing 
the  application  change  and  the  MI  should  wait  for  a  while  longer  before  detecting  a 
failure  The  core-set  now  receives  the  Submit  message,  completes  the  processing  of  the 
application  change,  and  propagates  a  Direct  message  to  the  LAN  mservers  and  then  to  the 
Mis  The  LAN  mservers  send  an  ACK  to  the  core-set,  and  the  Mis  send  an  ACK  to  the 
LAN  mserver,  indicating  successful  propagation  of  the  Direct  message. 

Figure  44  is  very  similar  to  Figure  43  except  the  Direct  message  is  lost  instead  of 
the  Submit  message.  The  MI  times  out  waiting  for  the  Direct  message  from  the  LAN 
mserver  and  sends  a  Msg  Query.  Instead  of  sending  a  Wait  message  to  the  querying  MI, 
the  LAN  mserver  sends  a  Msg  Query  to  the  core-set.  The  mserver  in  the  core-set 
receiving  the  query  resends  the  lost  Direct  message,  which  is  propagated  to  the  MI  with 
A('Ks  returned  at  every  level,  and  then  to  the  application. 

Figure  45  shows  the  failure  of  a  core-set  mserver  after  processing  the  application 
change,  but  before  multicasting  the  change  directive  to  the  children  mservers.  Using 
message  timeouts  and  retries,  the  LAN  mserver  detects  the  parent  mserver  in  the  core-set 
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fiuled.  The  MS  hierarchy  is  partitioned  at  the  failed  core-set  mserver,  causing  a  partition 
in  the  application  group  as  well. 
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Ist-kvel  mserver  m  ^ 
submitliag  MI 
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Figure  43;  Application  Group  Change  With  Lost  Submit  Message 


Submit  Initiate  Commit  Direct 


Figure  44;  Application  Group  Change  With  Lost  Direct  Message 
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Figure  45;  Application  Group  Change  With  Failed  Coordinator 


E.  PARTITION  RESOLUTION 


The  final  protocol  provides  the  means  for  the  MS  hierarchy  and  application  groups  to 
dynamically  reconfigure  in  the  event  of  network  partitions.  The  reconfiguration  method  of 
the  physical  hierarchy  is  fixed,  whereas  the  reconfiguration  method  used  by  each 
application  group  is  determined  by  QoS  selections  made  by  each  MSU  when  the 
application  group  is  initially  created 

1 .  Dynamic  Reconfiguration  of  Physical  Core-set 

a.  Perceived  Failures  and  PartitioHs 

As  discussed  previously,  the  actual  failure  of  one  or  more  mservers  in  a 
core-set  is  indistinguishable  from  the  perceived  failure  of  these  mservers  caused  by  an 
interruption  in  the  network  communication  capability  For  this  reason,  all  perceived 
failures  are  treated  as  actual  failures.  The  Med  mservers  are  excluded  from  further 
core-set  communications,  and  the  core-set  is  reformed  without  the  failed  mservers.  One 
partition  of  the  core-set  will  contain  the  original  parent  mserver,  the  other  partitions  will 
not.  This  means  that  the  physical  hierarchy  of  mservers  is  also  partitioned.  However, 
there  exists  a  possibility  that  the  mservers  perceived  as  failed  are  still  functioning.  It 
would  be  desirable  to  have  these  mservers  automatically  rejoin  the  original  core-set  when 
the  network  partition  is  repaired.  This  protocol  provides  the  means  for  this  automatic 
reformation  of  the  physical  hierarchy 

b.  Automatic  Reformation  Using  the  Shared  Multicast  Group 

The  monitoring  protocol,  basic  change  protocol,  and  broadcast  election 
protocol  provide  the  means  to  detect  failures  or  percdved  failures  of  mservers.  The  basic 
change  protocol  provides  the  means  to  process  the  failure  of  core-set  mservers  and  reform 
the  core-set.  An  example  of  such  a  reformation  due  to  a  network  partition  is  displayed  in 
Figure  46,  with  the  reformed  subsets  of  mservers  shown  in  Figure  47.  Although  the 
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perceived  failed  mservers  have  been  removed  from  the  core-set  membership,  they  have  not 
been  removed  from  the  membership  of  the  multicast  group  which  the  core-set  uses  to 
multicast  change-related  messages.  This  provides  the  means  by  which  an  automatic 
reformation  of  the  original  core-set  may  be  accomplished. 

Once  the  core-set  is  reformed  without  the  perceived  failed  mservers, 
attempts  are  periodically  made  to  reestablish  communications  with  the  other  partition  of 
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Simultaneous  failures 
detected  by  ml  and  m9. 
Two  partitions  abo  cibt 
in  the  set  as  shown. 


Coordinator  ml  unable  to 
send  to  md,  m7,  and  m9. 
Coordinator  m9  unable  to 
send  to  ml,  m2,  and  m3. 


After  processing  Cl  and  C2, 
the  set  has  reformed  into  two 
subgroups.  Subgroup  1  is 
larger  becanse  Cl  had  priority. 


Figure  47;  Partitioning  of  a  Core-set 


mservers.  These  attempts  are  made  by  multicasting  query  messages  to  the  original 
core-set  multicast  address  Current  members  of  the  core-set  ignore  these  queries; 
however,  an  mserver  in  the  other  partition  will  respond  to  a  query,  if  able  If 
communication  is  reestablished  within  a  predetermined  timeout  period,  a  simple  merge  of 
the  two  partitions  is  conducted,  restoring  the  original  core-set,  with  the  exception  of  any 
new  additions  or  deletions  to  either  partition.  The  group  view  of  the  reformed  core-set  is 
set  to  one  more  than  the  higher  view  number  of  two  formerly  partitioned  subsets.  This  is 
the  same  action  that  would  be  performed  with  an  ordinary  merge  of  two  separate  core-sets 
of  mservers.  The  original  parent  mserver  is  now  a  member  of  both  of  the  reformed 
subsets,  so  the  physical  hierarchy  is  also  automatically  restored. 

c.  Unique  Names  and  Addresses  of  Partitioned  Core-sets 

In  the  event  that  the  partitions  of  mservers  are  unable  to  restore 
communciations,  the  reformed  subsets  are  converted  to  completely  independent  core-sets. 
Since  all  core-sets  of  mservers  must  have  a  unique  name  and  multicast  address,  some 


78 


W  ||J  V  jliii 


method  must  be  used  to  automatically  obtain  these  ^iuque  values.  To  obtain  a  unique 
name,  each  sub-core-set  appends  a  unique  suffix  to  the  original  core-set  group  name  This 
suffix  value  must  be  automatically  derived  by  each  partitioned  subset  of  mserves 
independently,  and  with  a  guaranteed  unique  value  for  all  partitioned  subsets  The  most 
readily  available  attribute  that  all  subsets  can  use  to  obtain  a  guaranteed  unique  name  is 
the  original  group  identity  (gid)  of  a  significant  mserver  remaining  in  each  partition.  The 
lowest  mserver  gid  of  the  mservers  remaining  in  each  partition  is  appended  to  the  original 
core-set  group  name.  In  this  manner,  all  partioned  core-sets  are  guaranteed  a  unique 
core-set  name.  However,  all  partioned  core-sets  are  still  easily  identifiable  as  subsets  of 
the  original  core-set,  which  simplifies  the  task  of  manually  reconfiguring  the  physical 
hierarchy  when  the  network  is  repaired. 

2.  Dynamic  Reconfiguration  of  Application  Groups 
a.  Reconfiguration  Rules 

Any  partition  of  the  MS  physical  hierarchy  results  in  a  partition  of  all 
application  groups  which  spanned  the  original  hierarchy.  The  method  of  resolving  the 
partitions  of  each  of  the  application  groups  depends  on  the  QoS  selection  made  by  the 
MSU  at  the  time  the  application  group  was  created.  The  MSU  uses  the  size,  membership 
and  name  characteristics  associated  with  an  application  group  as  the  parameters  to  specify 
how  partitions  will  be  resolved.  These  parameters  are  used  by  two  rules  which  explicitly 
determine  how  partitioned  subgroups  will  be  handled.  These  rules  are; 

I  Keep  alive  any  partitioned  subgroups  that  meet  a  certain  condition  specified 
by  the  user.  Any  subgroups  which  do  not  meet  the  condition  will  be 
terminated. 

2.  Partitioned  subgroups  will  attempt  to  find  and  merge  with  other  partitioned 
subgroups  that  have  a  certain  user-specified  property. 
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The  first  rule  utilizes  a  user-specified  condition  related  to  group  size  and/or 
membership  to  determine  which  subgroups  will  continue  to  function  Using  group  size  as 
the  condition  for  deciding  which  subgroups  survive,  the  MSU  may  specify  that  all 
subgroups  terminate  upon  a  partition  by  selecting  a  size  equal  to  the  original  group  size 
All  partitioned  subgroups  would  be  smaller  than  the  original  group,  and  would  therefore 
terminate,  also  terminating  the  application.  Similarly,  the  MSU  may  specify  that  all 
subgroups  survive  a  partition  by  selecting  a  limiting  size  equal  to  zero  All  partitioned 
subgroups  would  be  larger  than  the  selected  size  and  would  continue  to  function.  Any 
size  between  zero  and  the  original  group  size  may  be  selected,  permitting  subgroups  larger 
(or  smaller)  than  the  specified  size  to  continue  to  function,  and  terminating  all  subgroups 
smaller  (or  larger).  The  membership  of  the  group  may  also  be  used  to  determine  which 
subgroups  survive.  The  MSU  may  specify  the  condition  that  a  subgroup  must  have  a 
particular  member  or  type  of  member  to  continue  to  function  Any  subgroups  not 
containing  such  a  member  will  terminate. 

The  second  rule  utilizes  an  MSU  specified  property  related  to  the  original 
group  name  or  the  identity  and  location  of  significant  members  of  the  group  to  determine 
which  partitioned  subgroups  will  attempt  to  merge.  The  simplest  case  is  that  all 
subgroups  attempt  to  merge  with  all  other  subgroups  of  the  original  group.  The  property 
used  is  the  same  base  group  name  common  to  all  subgroups  from  the  original  group. 
Another  simple  case  is  that  no  subgroups  attempt  to  merge.  Use  of  the  null  property 
ensures  that  no  subgroups  attempt  to  merge  with  other  subgroups.  The  identity  of  certain 
key  members  of  the  original  group  may  be  used  as  the  property,  also  Partitioned 
subgroups  attempt  to  merge  with  the  subgroup  containing  these  key  members. 

By  combining  these  two  rules,  a  wide  variety  of  partition  resolution  methods 
can  be  produced.  The  first  rule  determines  which  partitions  survive,  and  the  second  rule 
determines  which  partitions  attempt  to  merge.  Each  rule  can  also  combine  multiple 
parameters  to  provide  very  specific  and  flexible  methods  of  handling  partitions.  For 
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example,  all  subgroups  larger  than  a  size  of  three  which  contain  a  particular  member  type 
will  survive  and  attempt  to  merge  with  subgroups  with  the  same  base  group  name 
containing  another  particular  member  type 
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V.  CORRECTNESS  ARGUMENTS 


In  the  previous  chapters  the  architecture  and  protocol  descriptions  for  a  global, 
decentralized  membership  service  were  presented.  In  this  chapter  arguments  and  proofs 
are  presented  to  show  that  the  MS  protocol  performs  correctly  under  all  circumstances. 
The  correct  performance  of  the  MS  protocol  leads  to  achievement  of  the  desired  attributes 
of  the  MS,  as  discussed  in  Chapter  II.  The  arguments  presented  here  focus  on  the 
functioning  of  a  single  core-set  of  mservers,  treating  the  set  as  a  group  in  itself,  with  the 
individual  mservers  in  the  core-set  as  members  of  the  group.  The  proofs  show  that 
changes  to  the  membership  of  this  group  are  made  in  a  manner  which  always  maintains 
strong  consistency  of  the  membership  information  at  ail  members  of  the  core-set.  The 
arguments  about  the  correct  operation  of  a  single  core-set  of  mservers  can  then  be 
extended  to  the  physical  hierarchy  of  core-sets  of  mservers,  and  then  to  the  application 
groups  which  utilize  the  MS,  showing  that  consistent  membership  information  is  always 
obtained  at  all  application  process  group  members. 

The  assumptions  and  definition  of  terms  used  in  the  proofs  are  listed  first,  with  their 
specific  implications  with  respect  to  the  correctness  arguments  described.  These  are 
followed  by  a  description  of  the  criteria  for  correctness  and  a  summary  of  the  actions  that 
the  protocol  takes  to  maintain  the  membership  knowledge  accordingly.  Finally,  key 
statements  about  different  aspects  of  the  protocol  are  proven,  thus  proving  the  correctness 
of  the  MS  protocol. 

A.  ASSUMPTIONS 

As  described  previously,  an  asynchronous  communication  environment  is  assumed  to 
exist,  providing  an  unreliable  message  delivery  capability  with  an  unbounded  delay,  as  in 
the  present  best-effort  Internet.  Thus,  network  fulures  that  include  dropped  messages  and 
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network  partitions  are  permitted.  All  member  failures  are  assumed  to  be  crash  or  fail-stop 
[5,  9,  10,  1 1],  In  such  conditions,  failures  can  only  be  perceived,  and  both  actual  member 
failures  and  network  partitions  lead  to  perceived  Mlures  of  the  members.  For  this  reason, 
every  perceived  failure  is  processed  as  an  event  that  partitions  the  group.  Partitions  of  the 
membership  of  a  group  are  assumed  to  be  acceptable  to  the  user  of  the  membership 
service,  who  may  make  QoS  selections  to  determine  how  partitioned  groups  will  continue 
to  function,  as  described  in  earlier  chapters.  Unlike  many  other  membership  protocols, 
majority-based  decisions  are  not  used  by  the  MS  protocol  to  ensure  that  only  a  single 
partition  survives;  instead,  complete  agreement  is  required  among  all  surviving  members, 
leading  to  the  possibility  of  separate,  functioning  partitions  of  any  size.  Continuous 
changes  to  the  membership  are  allowed;  however,  the  changes  are  committed  one  at  a 
time,  and  with  a  specific  order  in  each  partition. 

B.  TERMS  AND  DEFINITIONS 

The  specific  terms  and  implications  of  their  use  in  the  correctness  arguments 
described  later  are  listed  below. 

1.  Change  Events 

The  events  that  cause  a  change  in  the  membership  are;  explicit  join  and  leave 
requests  by  members,  perception  of  failure  by  the  monitoring  of  members  by  other 
members,  and  suspicion  of  failures  resulting  from  member  or  network  failures  which  lead 
to  a  lack  of  response  during  change  processing. 

2.  Change  Event  Priority 

Every  change  event  has  an  associated  priority  to  enable  ordering  of  virtually 
simultaneous  changes.  Failures  have  a  higher  priority  than  voluntary  joins  or  departures. 
Priority  of  a  failure  or  departure  event  is  the  rank,  or  seniority,  of  the  failed  member  in  the 
group.  The  most  senior  member  always  has  a  rank  of  0.  When  two  or  more  members 
initiate  a  change  simultaneously,  the  coordinator  initiating  the  higher  priority  change,  as 
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determined  by  the  rank  of  the  subject  of  the  change,  prevails.  In  virtually  simultaneous 
joins,  the  subjects  do  not  yet  have  a  group  rank,  so  the  network  address  of  each  subject  is 
used  in  place  of  the  rank.  The  subject  with  the  lower  network  address  will  be  interpreted 
as  having  a  higher  temporary  rank,  and  therefore  will  have  a  higher  priority,  joining  the 
group  first. 

3.  Isolation 

A  member  that  perceives  another  member  as  faulty  ceases  all  communication 
with  that  faulty  member.  This  leads  to  the  member  perceived  as  faulty  also  determining 
that  the  other  member  is  faulty,  since  no  communications  are  received. 

4.  Gossip 

A  member  that  isolates  another  member  gossips  about  the  isolation  in  the 
subsequent  communication  it  has  with  every  other  group  member.  Thus,  in  the  absence  of 
any  other  failures,  a  multicast  following  an  isolation  leads  to  the  whole  group  isolating  the 
member  that  was  perceived  faulty  by  the  sender  of  the  multicast. 

5.  Group  View 

This  term  denotes  the  ordered  membership  list  maintained  by  each  member  m,, 
and  is  denoted  as  Wicw where  x  denotes  the  view  number. 

a.  Definition 

The  group  view  at  a  member  is  the  set  of  members  that  are  believed  to  be 
part  of  the  group.  It  is  ordered  with  respect  to  the  seniority  of  members  in  the  group  and 
has  an  integer,  called  a  view  number,  associated  with  it. 

b.  Remarks 

Every  membership  change  alters  the  number  of  members  in  the  view  at  a 
member  and  leads  to  the  installation  of  a  new  view  identified  by  the  next  higher  view 
number.  The  number  of  members  in  the  group  may  change  by  more  than  one  in  a  single 
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view  change.  The  ra^v-  of  a  member  denote  its  seniority  in  the  group,  with  the  most 
senior  member  having  0.  Identical  views  imply  identical  membership  as  well  as  ranks. 

6.  Group  Partition 

Let  G  denote  the  set  of  all  possible  potential  and  current  members  of  a  group.  A 
partition  P  of  G  is  defined  below. 

a.  Definition 

P  is  a  subset  of  the  all  members'  set  G,  such  that  V  e  P,  if  View^f/n,) 
and  View^(/w^)  are  defined,  then  •  sv  (*»,);  e  View,(/w,)  VvewJjn^, 

and  all  members  have  the  same  rank  u.  '.w  views. 

b.  Remarks 

The  view  associated  with  partition  t  is  denoted  VieWp,  and  the  partition 
containing  m,  is  denoted  P(m).  Thus,  all  members  in  a  partition  must  have  identical  views. 
However,  it  is  possible  that  there  exists  an  outside  .?  partition,  but  still  in  every 
member's  view  for  a  particular  partition.  Such  partitions  are  called  unsUskk  partitions. 
The  MS  protocol  treats  such  a  partition  as  legal,  and  eventually  removes  from  the 
views  of  all  members  of  the  partition.  When  no  such  exists  for  a  partition,  the  partition 
is  called  stable.  Network  and  member  fiulures  lead  to  the  creation  of  group  partitions  in 
asynchronous  envirorunents. 

7.  Group  Membership  Protocol 

Using  the  definitions  of  the  terms  above,  a  protocol  is  defined  to  solve  the  Group 
Membership  Problem  (GMP)  as  below. 

a.  D^nition 

A  protocol  solves  the  GMP  correctly  if  every  change  event  results  in  group 
partitions  eventually. 
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h.  Remarks 

The  above  definition  of  a  correct  solution  of  the  GMP  requires  it  to  satisfy 
distinct  properties  corresponding  to  the  underiined  conditions  in  the  definition  above 

•  El  This  property,  arising  from  the  condition  of  every,  requires  that  a 
change  event  observed  by  a  member  is  processed  despite  other  virtually  simultaneous 
change  events  and  failures  during  protocol  execution,  including  that  of  the  coordinator. 
The  only  situation  in  which  a  change  event  is  not  processed  is  in  case  of  catastrophic 
occurrences  in  which  all  the  members  with  knowledge  of  the  change  event  suffer  real 
failures. 

•  E2  This  property,  arising  from  the  condition  of  eventually,  permits  the 
processing  of  a  change  event  to  be  suspended  temporarily;  however,  it  requires  that  the 
resulting  view  is  always  installed  at  all  members  of  the  partition  before  the  change  event 
occurred  after  only  a  finite  number  of  changes  tu'e  allowed  to  take  place. 

•  GP  This  property,  arising  from  the  condition  of  group  partitions,  implies 
identical  views  at  all  members  of  each  partition.  As  per  the  protocol  described,  all 
partitions  resulting  from  change  processing  always  become  stable. 

Requirements  imposed  by  the  El  and  E2  properties  satisfy  the  condition 
commonly  known  as  liveness  in  distributed  systems  and  those  imposed  by  the  GP 
property  satisfy  safety  [5,  27,  28].  Thus,  the  uniqueness  of  views  and  identical  ordering  of 
changes  at  all  operational  members  is  guaranteed  by  GP. 

C.  REMARKS  ON  THE  PROTOCOL  STRUCTURE 

The  previous  chapter  described,  in  detail,  how  the  protocol  handles  various  change 
events.  The  functions  of  the  different  components  of  the  protocol  are  summarized  in  the 
following  paragraph.  Unless  specified  otherwise,  the  term  failure  is  assumed  to  imply  a 
perceived  member  failure  that  may  have  been  caused  by  either  a  network  failure  or  a 
member's  failure. 


Any  of  the  members  may  initiate  a  change  when  it  perceives  a  change  according  to  the 
change  events  described  earlier.  The  change  initiator  is  called  the  coordinator  for  that 
change  and  carries  out  the  basic  membership  change  protocol  listed  in  Figure  17.  The 
normal  two-phase  change  processing  is  illustrated  in  Figure  16.  The  first  phase  consists  of 
a  multicast  of  the  Initiate  message  to  all  the  members  followed  by  collection  of  ACKs 
from  all  members.  As  specified  in  Figure  20,  the  coordinator  collects  ACKs  from  all 
members  it  believes  to  be  in  the  group  while,  at  the  same  time,  trying  repeatedly  to  send 
the  Initiate  message  to  those  that  it  believes  to  be  present  but  from  whom  a  response  is 
not  forthcoming  due  to  a  failure.  The  second  phase  consists  of  multicasting  the  Commit 
message.  The  members  that  do  not  send  an  ACK  are  isolated  and  gossiped  about  during 
the  commit  phase. 

The  non-coordinator's  actions  of  Figure  18  consist  of  sending  the  ACK  message  and 
committing  the  change.  Once  the  Initiate  message  is  received,  the  receiving  mserver 
prompts  the  coordinator  repeatedly  if  a  Commit  message  is  not  received,  as  specified  in 
Figure  IS.  If  a  Commit  message  is  not  received  due  to  a  failure,  the  mserver  expecting  the 
message  starts  a  broadcast  election.  As  specified  in  Figure  30,  all  of  the  members  that 
have  received  but  not  yet  committed  the  incomplete  change  elect  the  highest  rank  member 
as  the  coordinator.  The  elected  coordinator  then  resumes  processing  of  this  change  as 
specified  in  Figure  32.  If  the  coordinator  failure  was  initiated  before  any  member  could 
commit  the  change,  it  is  resumed  with  an  Initiate  multicast  by  the  elected  coordinator.  If 
at  least  one  member  that  participates  in  the  election  had  committed  the  change,  then  the 
newly  elected  coordinator  resumes  the  change  by  sending  a  Commit  message. 

Due  to  the  possibility  of  other  changes  occurring  during  a  change  processing,  both  the 
coordinator  and  non-coordinators  must  take  additional  actions  as  specified  in  Figures  25 
and  24,  respectively.  In  Figure  25,  the  specification  of  Figure  20  is  augmented  to  permit 
the  coordinator  to  handle  messages  in  addition  to  the  ACKs  for  the  initiated  change. 
Depending  upon  the  message  received  by  the  coordinator  as  it  collects  the  ACKs,  it 
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switches  to  a  higher  priority  change,  enters  an  election,  or  delays  the  change  it  is 
coordinating  due  to  a  previous  change  that  may  not  yet  have  completed. 

Similarly,  Figure  24  is  the  augmented  version  of  Figure  1 5  to  handle  situations  in 
which  the  non-coordinator  does  not  get  the  expected  Commit  or  Direct  message.  The 
additional  actions  permit  the  non-coordinator  to  either  switch  to  a  higher  priority  change, 
start  an  election  if  the  coordinator  has  failed,  or  delay  another  coordinator  that  attempts  to 
install  the  next  view  change. 

D.  CORRECTNESS  ARGUMENTS 

Based  on  the  protocol  summary  above  and  the  detailed  description  given  in  the 
previous  chapter,  a  proof  is  presented  that  shows  that  the  MS  protocol  has  all  the 
properties  as  identified  above  for  a  correct  solution  to  the  GMP.  Also  shown  is  that  a 
more  refined  solution  to  the  GMP  defined  earlier  by  Ricciardi  and  Birman  [9]  is  possible. 

1.  Claim  1 

Change  event  processing  always  completes  at  both  the  coordinator  and  the 
non-coordinator  except  when  all  members,  including  the  coordinator,  with  knowledge  of 
the  change  fail. 

2.  Proof 

Consider  a  change  event  change(subject,  coordinator)  initiated  in  P. 

a.  At  the  coordinator 

Although  the  coordinator  makes  multiple  attempts  to  deliver  the  Initiate 
message  to  all  perceived  members  of  P,  it  does  not  require  a  predetermined  number  of 
them  to  respond  before  it  sends  a  Commit  message  (line  S,  Figure  17).  If  the  coordinator 
switches  to  a  higher  priority  change  before  it  sends  a  Commit  message,  the  information 
about  the  old  change  is  saved.  The  old  change  is  reinitiated  after  all  higher  priority 
changes  complete. 
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h.  At  the  Hon-coonSnator 

If  the  coordinator  fails,  at  least  one  member  times  out  on  the  Commit 
message  and  starts  an  election  (line  7,  Figure  18).  The  highest  rank  member  with  the 
change  active  is  elected  to  resume  the  change  (line  6,  Figure  30).  The  fact  that  the 
election  is  conducted  among  those  with  knowledge  of  the  change  ensures  that  the  change 
completes  even  if  the  coordinator  and  the  only  members  to  have  committed  the  change 
fail.  This  takes  care  of  the  invisible  commits  described  by  Ricciardi  and  Birman  [9]. 

3.  Claim  2 

In  any  partition,  either  only  one  change  event  proceeds  to  the  commit  phase,  or 
members  reaching  the  commit  phase  for  different  change  events  form  separate  partitions. 

4.  Proof 

Initially,  all  members  have  identical  views  of  the  membership  (definition  of  a 
partition).  In  the  set  of  all  potential  change  events,  there  exists  a  unique  priority  order  due 
to  the  uniqueness  of  ranks,  which  order  failures  and  departures,  and  network-level 
addresses,  which  order  joins.  This  permits  every  member  receiving  multiple  Initiate 
messages  before  receiving  any  Commit  message  to  switch  to  the  highest  priority  change 
that  will  install  the  next  view.  Overlapping  of  Initiate  messages  to  install  successive  views 
with  different  view  numbers  is  not  possible  (line  6,  Fi^re  25). 

Suppose  a  member  receives  a  Commit  message  for  the  current  change  that  will 
change  the  view  number  from  jc  to  ac+  7 .  Suppose  this  mserver  then  receives  a  higher 
priority  change  that  also  corresponds  to  a  view  number  change  from  x  to  x+ 1.  It  is 
guaranteed  that  the  sender  of  the  higher  priority  change  appears  in  the  gossip 
accompanying  the  received  Commit  message.  This  happens  because  the  coordinator  of 
the  lower  priority  change  will  have  timed  out  on  the  coordinator  for  the  higher  priority 
change  and  isolated  it  before  generating  a  Commit  message  (line  6,  Figure  17).  This 
ensures  further  partitioning  if  more  than  one  change  events  proceed  to  the  commit  phase. 
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5.  Claim  3 

If  the  coordinator  fails  after  sending  the  commit  message,  the  two-phase 
protocol  consisting  of  an  election  followed  by  a  commit  can  solve  the  group  membership 
problem  correctly. 

6.  Proof 

Begin  by  proving  the  contrapositive  statement: 

The  two-phase  protocol  consisting  of  an  election  followed  by  a  commit  cannot 
solve  the  GMP  correctly  if  the  coordinator  fails  before  sending  the  commit. 

If  the  coordinator  fails  before  sending  the  Commit  message,  it  is  possible  that  one 
of  the  members  has  not  yet  received  the  Initiate  message  for  the  change.  This  member 
would  respond  in  the  election  with  a  Coord-Fail  message  that  announces  that  it  is  not 
aware  of  the  change  for  which  the  election  has  been  started.  This  member  must  receive  an 
Initiate  message  before  it  can  commit  the  change  for  which  the  coordinator  failed.  If  the 
Coord-Fail  message  is  used  to  start  the  change  in  place  of  a  separate  Initiate  message,  and 
only  a  Commit  message  is  sent  to  complete  the  change,  then  the  GP  property  can  be 
violated,  as  shown  in  the  example  below. 

Consider  a  p  mition  consisting  of  members  m,,  /n^,  w^,  C^,  and  Q.  Let  C^  initiate 
change  "a"  by  multicasting  initiate^,  which  is  received  only  by  m.  due  to  network  failures. 
(\  fails  immediately  after  sending  Initiate^,  and  this  failure  is  perceived  by  m^,  which  then 
starts  an  election  by  multicasting  Coord  Fail^.  m^  and  m^  participate  in  the  election,  but 
(  \ does  not  because  it  has  failed.  However,  before  failing,  Q  starts  another  higher  priority 
change  by  multicasting  Initiate,^,  which  reaches  only  m^  due  to  network  failures.  Since 
change  "b"  is  a  higher  priority  change,  m^  drops  change  "a”  as  the  current  change.  At  this 
point,  m^  perceive  (\  failed  and  starts  an  election  by  multicasting  Coord  Fail^. 
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Throughout  this  time,  m,  waits  to  hear  C^'s  response  to  the  election  for  change 
"a",  which  will  not  arrive  due  to  C\'s  failure  before  Coord  Fail ^  reaches  it.  Eventually,  m, 
times  out  in  the  election,  determines  that  it  must  be  the  winner,  and  assumes  the 
responsibility  for  completing  change  "a”,  m,  commits  change  "a”  and  multicasts  Commit^ 
to  the  group  with  gossip  about  C^'s  isolation.  If  the  Commit^  reaches  and  w*  after  they 
have  switched  to  change  "A"  due  to  the  Coord  Fail ^  message,  they  will  quietly  discard  the 
Commit^  message  due  to  its  lower  priority.  Thus,  m,  will  have  committed  change  "a”, 
whereas  and  will  never  commit  it.  This  inconsistency  violates  the  GP  property  and 
makes  the  two-phase  protocol  incorrect.  Thus,  the  contrapositive  statement  is  proved. 

The  contrapositive  statement  proves  Claim  3  above.  It  should  be  noted  that  the 
failure  of  the  coordinator  after  sending  the  Commit  message  with  simultaneous  failures  of 
all  members  that  receive  the  Commit  messj^e  is  equivalent  to  the  coordinator  failing 
before  sending  the  Commit  message.  It  is  not  possible  to  differentiate  between  these  two 
situations,  thus  the  change  must  be  completed  in  three  phases.  In  the  protocol  described 
in  this  thesis,  the  three  phases  are  the  broadcast  election,  initiate,  and  commit  phases. 
Thus,  the  Resume_change  procedure  of  Figure  32  requires  the  elected  coordinator  to 
complete  the  change  with  a  Commit  message  if  some  member  that  had  committed  the 
change  participates  in  the  election,  permitting  a  two-phase  processing  of  the  coordinator 
failure.  Otherwise,  the  elected  coordinator  simply  reinitiates  the  change,  providing  a 
three-phase  processing  of  the  original  change. 

7.  Theorem 

The  proposed  group  membership  protocol  is  safe  and  live. 

8.  Proof 

The  liveness  properties  follow  directly  from  Claim  1.  The  sa/feO'  property 


follows  from  Claim  2  and  3. 


VI.  MEMBERSHIP  SERVICE  IMPLEMENTATION 


This  chapter  describes  a  partial  implementation  of  the  MS  specified  in  previous 
chapters  on  a  campus-wide  set  of  LANs  with  UNIX-based  workstations.  The  use  and 
limitations  of  the  IP  multicast  capability  are  described,  as  well  as  the  needs  of  the  MS  not 
met  by  the  IP  multicast  capability.  To  meet  some  of  these  unfulfilled  multicasting  needs,  a 
multicast  emulation  program,  called  mcaster,  was  developed  The  design  and 
implementation  of  this  program  are  described.  A  complete  set  of  utility  functions  for  use 
by  the  mcaster  and  MS  programs  were  developed,  and  are  described  in  detail.  High-level 
descriptions  of  the  algorithms  used  to  implement  mservers  and  Mis  are  presented.  A 
working  implementation  of  the  shell  of  the  mserver  program  is  also  presented  The 
software  code  for  the  mcaster  program,  the  utility  functions,  and  the  mserver  shell 
program  are  listed  in  the  Appendix  to  this  thesis 

A.  MULTICASTING 

The  use  of  multicast  message  delivery  is  essential  to  the  efficient  and  scalable 
operation  of  an  MS.  In  this  section  the  general  concept  of  multicast  message  delivery  is 
explained.  Two  implementations  of  multicast  facilities  are  described;  the  IP  multicast  and 
a  specially  written  multicast  emulation  program,  called  mcaster. 

1.  IP  Multicast 

A  recent  addition  to  the  IP  suite  of  services  is  the  IP  multicast  capability.  A 
multicast  is  the  multipoint  delivery  of  a  single  datagram,  originated  by  a  single  sender  and 
delivered  to  multiple  destinations  which  are  part  of  a  predesignated  multicast  group.  This 
is  in  contrast  to  a  broadcast,  which  is  a  multipoint  delivery  of  a  single  datagram  to  all 
connected  machines,  without  any  capability  to  limit  the  scope  of  the  delivery,  and  a 
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unicast,  which  is  a  point-to-point  datagram  delivery  In  effect,  a  multicast  is  the 
generalized  form  of  message  delivery,  providing  broadcasts  at  one  extreme  and  unicasts  at 
the  other  [29],  Previously,  the  capability  to  multicast  efficiently  was  limited  to  single 
LANs,  using  the  LAN  hardware  protocol  IP  multicast  provides  a  similar  capability  for 
machines  connected  over  the  Internet,  allowing  the  efficient  multicast  of  a  single  datagram 
to  multiple  receiving  machines  which  are  included  in  the  multicast  group,  as  shown  in 
Figure  48 


Figure  48:  IP  Multicast 


a.  IP  Multicast  Extensions 

Full  utilization  of  the  new  IP  multicasting  feature  requires  an  extension  to 
the  currently  installed  IP  implementation  on  each  host  machine.  The  document  which 
describes  how  this  extension  is  accomplished  [6]  defines  three  levels  of  conformance  to 
the  specification:  Level  0,  with  no  support  provided  for  IP  multicast  (the  current 
configuration  for  most  machines).  Level  1  which  provides  limited  support  for  sending 
multicasts  but  not  for  receiving  multicasts,  and  Level  2,  which  provides  full  IP  multicast 
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support.  Level  2  requires  the  implementation  of  the  Internet  Group  Management  Protocol 
(IGMP),  which  manages  the  dynamic  multicast  groups  which  a  host  must  join  to  receive 
multicast  datagrams.  A  depiction  of  the  layered  model  for  IP  multicast  is  shown  in  Figure 
49,  provided  by  reference  [6]. 


Upper>Layer  Protocol  Modules 


IP  Service  Interface 

IP 

ICMP 

IGMP 

Module 

Local  Network  Service  Interface 


Local 
Network 
Modules 
(e.gM  Ethernet) 


IP-to-local  address  mapping 
(e.g.,  ARP) 


Figure  49:  IP  Multicast  Layered  Model 

Full  use  of  IP  multicast  of  datagrams  requires  that  hosts  join  a  dynamic 
multicast  group.  This  group  is  identified  exclusively  and  uniquely  by  the  32-bit  IP  address 
used  to  transmit  a  datagram  to  the  group.  A  set  of  IP  addresses  has  been  reserved 
specifically  for  IP  multicast.  These  are  referred  to  as  class  D  IP  addresses,  with  the  first 
four  bits  of  the  address  set  to  '1 1 10'  [6].  The  range  of  these  class  D  addresses  is  from 
224.0.0.0  to  239.255.255.255,  using  the  common  dotted  decimal  notation  to  specify  IP 
addresses.  Addresses  between  224.0.0.0  and  224.0.0.255  inclusive  are  reserved  for 
multicast  routing  and  maintenance  protocols  [7],  but  all  others  class  D  addresses  are 
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available  for  use,  providing  a  total  multicast  address  space  of  over  268  million  addresses. 
A  few  of  these  addresses  are  permanently  assigned,  but  most  are  available  for  transient 
multicast  groups.  Additionally,  the  IP  multicast  specification  provides  a  time-to-live  {ttl) 
variable  associated  with  each  multicast,  controlling  the  transmission  scope  of  any  multicast 
datagram.  With  judicious  use  of  the  ttl  variable,  it  is  possible  to  use  virtually  any  class  D 
address  for  a  given  host  group  without  worrying  about  prior  assignment  of  that  class  D 
address. 

As  described  earlier,  full  level  2  conformance  requires  implementation  of  the 
IGMP  to  manage  these  multicast  groups.  As  shown  in  Figure  49,  IGMP  is  an  integral  part 
of  the  IP  protocol  layer  when  implemented  at  a  host  or  gateway.  IGMP  controls  the 
relationship  between  a  multicast  router  and  a  set  of  host  machines  participating  in  a 
multicast  group.  Multicast  routers  and  host  machines  use  IP  datagrams  to  communicate 
status  back  and  forth,  similar  to  the  Internet  Control  Message  Protocol  (ICMP),  which  is 
used  to  report  errors  and  provide  information  about  unexpected  circumstances  between 
gateways  and  host  machines  [29].  IGMP  provides  a  mechanism  for  hosts  to  dynamically 
join  and  leave  multicast  groups,  and  for  local  multicast  gateways  to  monitor  the  group 
membership  as  well  as  provide  correct  routing  of  multicast  datagrams.  Hosts  and  local 
gateways  use  IP  multicast  datagrams  for  all  IGMP  communications,  using  the  "all  hosts" 
reserved  multicast  address  of  224.0.0. 1,  to  conduct  very  efficient  communication  [6].  The 
local  gateway  maintains  status  tables  to  record  local  group  membership  of  hosts.  It  also 
periodically  polls  all  connected  hosts  to  determine  if  they  are  still  part  of  the  specified 
groups.  In  this  manner,  a  very  efficient  management  of  IP  multicast  groups  is  performed. 

b.  IP  Multicast  Implementation 

The  most  common  implementation  of  multicast  applications  involves  the  use 
of  the  Berkeley  sockets  abstraction  provided  in  most  UNIX  environments  for  network 
I/O.  Sockets  are  a  generalization  of  a  UNIX  file  object,  and  provide  an  endpoint  for 
communications  [29].  There  are  normally  three  types  of  communication  used  for  various 
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applications;  reliable  stream  delivery,  using  SOCK  STREAM  type  of  socket, 
connectionless  datagram  delivery,  using  a  SOCK  DGRAM  type  socket,  and  a  raw  type  of 
communication,  using  the  SOCK  RAW  type  socket.  IP  multicast  supports  only  the 
SOCK  DGRAM  and  SOCK  RAW  types  of  sockets,  and  provides  no  support  for 
connected  sockets.  Additionally,  there  are  several  types  of  system  calls  for  sending  and 
receiving  datagrams,  most  of  which  are  similar  to  the  system  calls  for  UNIX  file  I/O.  IP 
multicast  supports  only  the  sendto,  sendmsg,  recvjrom,  and  recvmsg  system  calls  for 
datagram  transmission  and  reception  [7],  The  sendto  and  sendmsg  datagram  transmission 
calls  require  the  destination  (multicast  or  unicast)  address  as  an  input  parameter.  The 
recvfrom  and  recvmsg  system  calls  extract  the  sender's  address  from  the  header  of  the 
incoming  datagram.  Together,  these  calls  provide  a  very  efficient  means  of  combined 
unicast  and  multicast  network  communications,  since  the  only  difference  between 
communicating  with  a  single  host  or  a  multicast  group  is  the  address  used,  and  this 
address  is  readily  extracted  in  exactly  the  proper  format  to  send  a  reply  to  the  sender  for 
either  a  multicast  or  unicast  transmission.  The  format  of  the  IP  address  is  contained  in  a  C 
programming  language  structure,  called  sockaddrjn,  as  shown  in  Figure  50,  containing 
the  address  family,  port  number  and  IP  address  for  the  particular  host. 


Address  Family  |  Protocol  Port 


IP  address 


Unused 


Unused  (0) 


Figure  50;  IP  Socket  Address  Structure  (Sockaddr  in) 
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2.  Mcaster  program 

IP  multicasting  is  a  relatively  new  innovation,  and  is  not  widely  available  at  this 
time.  Due  to  the  very  limited  implementation  of  level  2  conformwce  to  the  IP  multicast 
specification  on  most  current  computer  networks,  it  was  decided  to  develop  a  program 
that  would  emulate  the  IP  multicast  capability  for  the  currently  available  unicast 
environment.  The  goal  was  to  develop  a  program  that  would  emulate  the  services 
provided  by  IP  multicast  as  transparently  as  possible;  hopefully  to  the  extent  that  a  user  or 
application  program  would  not  need  to  be  concerned  with  which  environment  was  actually 
being  used.  This  involved  simulating  all  of  the  functionality  provided  by  IGMP  at  the  host 
and  gateway  level. 

a  Mcaster  Design  Decisions 

The  overall  scheme  chosen  for  the  IP  multicast  emulator,  called  mcaster, 
was  to  have  a  "daemon"  process  running  at  a  well-known  site,  which  would  act  as  an 
intermediary  between  the  members  of  a  multicast  group,  providing  essentially  the  same 
services  as  those  provided  by  IGMP,  such  as  controlling  members  joining  and  leaving 
groups,  and  the  routing  of  multicast  datagrams  to  all  members  of  a  particular  group.  The 
primary  difference  between  an  IP  multicast  gateway  using  IGMP  and  the  mcaster  program 
is  that  mcaster  enjoys  none  of  the  hardware  support  that  a  router  would  include  - 
especially  the  ability  to  send  a  datagram  over  multiple  interfaces  at  once.  The  mcaster 
program  would  be  running  on  a  standard  host  computer,  probably  using  a  single  interface 
to  the  internetwork.  This  limitation  is  the  most  significant  difference  between  an  IP 
multicast  router  and  an  mcaster  host  computer;  whereas  a  router  can  send  the  same 
datagram  to  multiple  recipients  simultaneously  over  multiple  network  interfaces,  the 
mcaster  must  iteratively  send  the  datagram  over  one  interface,  causing  a  significant 
performance  degradation  over  IP  multicast.  However,  the  primary  goal  of  the  mcaster 
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emulator  was  to  provide  the  capability  of  multicasting,  not  to  match  the  performance 
possible  through  hardware  supported  multicasting. 

The  primary  reason  for  developing  the  mcaster  program  was  to  provide  a 
multicast  capability  for  use  by  the  membership  service  under  development  in  environments 
which  did  not  support  IP  multicast.  For  this  reason,  the  message  format  used  by  the 
mcaster  program  was  chosen  to  correspond  as  closely  as  possible  to  the  expected  needs  of 
the  membership  service  that  it  would  support.  The  basic  message  format  for  the  mcaster 
program  was  designed  to  also  be  the  basic  message  format  for  the  MS.  This  n  e 
format  was  previously  described  in  Figure  13  of  Chapter  IV.  Special  message  types  are 
reserved  for  mcaster  control  messages.  Although  the  mcaster  program  was  developed  to 
support  the  MS,  it  also  provides  a  general  multicast  capability  for  any  program  or  user. 
The  only  requirement  for  the  use  of  the  mcaster  program  is  that  messages  sent  by  the 
application  program  using  mcaster  must  include  a  header  structure  in  the  format  described 
above.  The  mcaster  program  will  then  be  able  to  deliver  messages  of  any  type  to  a 
designated  multicast  group. 

To  make  the  mcaster  daemon  as  capable  as  possible,  it  was  decided  to 
permit  each  mcaster  daemon  to  support  any  number  of  separate  groups,  each  with  an 
unlimited  number  of  members.  The  primary  data  structure  chosen  to  store  state 
information  for  all  groups  supported  by  an  mcaster  was  a  list  of  groups,  each  with  a  list  of 
members,  as  shown  in  Figure  S 1 .  Groups  and  their  members  are  dynamically  added  to  and 
removed  from  the  lists  as  needed. 

A  host  computer  desiring  to  join  an  mcaster  multicast  group  simply  formats 
a  message  with  the  JOIN  GROUP  message  type  and  sends  it  to  the  well-known  BP 
address  of  the  mcaster.  The  mcaster  processes  the  join  request  and  responds  with  a 
similar  message  indicating  success  or  failure  of  the  join  request.  Leaving  an  mcaster 
multicast  group  is  done  in  exactly  the  same  manner,  with  the  message  type  set  to 
LEAVE  GROUP.  Any  message  received  by  the  mcaster  which  is  not  a  join  or  leave 
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request  is  considered  to  be  a  message  to  multicast  to  the  group,  and  is  iteratively  sent  to 
each  member  of  the  indicated  group  using  the  sendto  socket  system  call.  Whereas  IP 
multicast  groups  are  exclusively  and  uniquely  identified  by  their  class  D  IP  address, 
mcaster  multicast  groups  are  identified  by  the  combination  of  a  group  name  and  an 
mcaster  IP  ’address. 


Figure  SI;  Mcaster  Data  Structures 


b.  Differences  from  IP  Multicast 

Originally,  it  was  hoped  that  the  use  of  the  mcaster  multicast  emulator 
would  be  completely  transparent  to  a  user  or  application  program;  that  is,  exactly  the  same 
system  calls  would  be  made  with  nearly  identical  arguments  for  either  multicast 
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environment,  with  identical  results,  in  a  manner  similar  to  that  shown  in  Figure  S2.  It  was 
soon  realized  that  there  were  several  deviations  from  the  desired  transparency  that  would 
be  necessary  to  make  the  mcaster  program  as  capable  as  desired. 


Figure  52;  IP  Unicast,  Multicast,  and  Mcaster  Using  Separate  Sockets 


The  first  deviation  had  to  do  with  the  ability  of  an  mcaster  to  manage  more 
than  one  group.  Whereas  IP  multicast  groups  are  exclusively  and  uniquely  identified  by 
their  class  D  IP  address,  mcaster  multicast  groups  are  identified  by  the  combination  of  a 
group  name  and  an  mcaster  IP  address.  Since  an  mcaster  is  a  daemon  process  running  on 
a  specific  host  machine  with  a  unique  IP  address,  all  of  the  groups  managed  by  that 
mcaster  must  share  the  same  group  IP  address  of  the  host  machine.  This  is  in  contrast  to 
IP  multicast  groups,  which  may  share  a  common  local  IP  multicast  router,  but  each  still 
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have  unique  IP  addresses.  The  only  implication  of  this  deviation  is  that  the  group  name 
had  to  be  included  in  the  message  itself,  so  that  the  mcaster  could  extract  the  group  name 
and  reference  the  desired  group.  With  IP  multicasts,  the  group  name  would  not  be 
required,  since  the  identity  of  the  group  is  implicit  in  the  unique  group  address. 

The  second  deviation  from  the  desired  transparency  between  IP  multicast 
and  mcaster  multicast  had  to  do  with  the  procedure  for  joining  and  leaving  groups.  This 
deviation  was  inherently  necessary  due  to  the  fact  that  mcaster  emulates  the  functionality 
of  IGMP,  so  a  mechanism  had  to  be  created  to  perform  the  same  functions.  IP  multicast 
uses  the  setsockopt  system  call  to  make  a  socket  multicast  capable.  The  sockaddr  in 
address  structure  bound  to  the  socket  is  first  loaded  with  the  class  D  address  of  the  group. 
The  setsockopt  call  is  then  made  with  the  IP  ADD  MEMBERSHIP  option  set  [7].  If  the 
address  used  is  a  legitimate  class  D  address,  then  IGMP  adds  the  calling  host  to  the 
specified  multicast  group.  Hosts  leaving  a  multicast  group  perform  the  same  routine,  with 
the  setsockopt  option  set  to  IP_DROP_MEMBERSHIP.  As  described  earlier,  a  host 
desiring  to  join  or  leave  an  mcaster  group  would  simply  format  a  message  with  the 
appropriate  message  type  and  send  it  to  the  host  running  the  mcaster.  The  functionality 
required  to  join  either  an  IP  multicast  or  mcaster  group  can  easily  be  encapsulated  within  a 
single  procedure,  perhaps  in  a  library  file,  giving  the  desired  transparency  between  the  two 
methods  of  multicasting  at  the  procedure  call  level.  The  same  holds  true  for  the  procedure 
to  leave  either  type  of  group. 

A  third  deviation  between  the  two  types  of  multicasting  did  not  directly 
affect  the  transparency,  but  could  have  adverse  effects  on  the  performance  of  the  mcaster 
program.  Unlike  IGMP,  once  a  host  joined  an  mcaster  group,  no  monitoring  of  group 
members  is  performed.  The  purpose  of  this  monitoring  in  IGMP  is  to  detect  members  no 
longer  participating  in  the  group  and  drop  them  from  the  membership.  It  was  decided  that 
this  was  unnecessary  for  the  mcaster,  the  lack  of  a  monitoring  capability  did  not  directly 
affect  the  ability  to  multicast  nor  the  desired  transparency  between  the  two  types  of 
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multicast,  since  the  user  would  normally  not  be  aware  the  monitoring  was  taking  place  at 
all.  Instead,  it  was  left  to  the  application  piogram  to  correctly  leave  an  mcaster  group. 
Failure  to  properly  leave  an  mcaster  group  would  burden  the  mcaster  daemon  with 
sending  extra  messages  to  hosts  no  longer  participating  in  the  group,  increasing  the  time 
required  to  multicast  to  other  legitimate  hosts  in  the  group,  as  well  as  the  overhead 
required  to  store  the  state  of  members  no  longer  participating  in  the  group.  The 
functionality  required  to  monitor  the  status  of  group  members,  to  detect  non-participation, 
and  to  remove  non-participating  members  could  be  added  to  the  mcaster  program  at 
some  future  time  if  desired,  but  would  likely  affect  the  transparency  of  the  mcaster 
program  to  application  programs. 

The  final  deviation  in  the  transparency  between  the  two  types  of  multicasting 
was  the  most  significant.  Due  to  the  sender's  IP  address  being  included  in  the  datagram 
header,  the  receiving  host  can  easily  extract  the  sender's  address  using  the  recyfrom 
system  call.  Normally  this  is  a  very  desirable  trait,  useful  for  quick  and  easy  replies  to  the 
sender  of  a  datagram.  The  problem  encountered  was  that  the  mcaster  program  acts  as  an 
intermediary  for  all  multicasts  between  group  members,  extracting  the  group  name  from 
the  message  to  reference  the  proper  group,  then  sends  the  original  message  to  all 
members.  In  so  doing,  the  sender's  address  in  the  datagram  header  is  changed  to  the 
mcaster  host  instead  of  the  original  sender  It  was  therefore  no  longer  possible  for  a 
receiving  host  to  extract  the  original  sender's  address  from  the  datagram  header;  instead 
only  the  mcaster's  address  could  be  recovered.  To  remedy  the  inability  of  a  receiving  host 
to  determine  the  original  sender  of  an  mcaster  multicast,  it  was  required  to  prepend  the 
original  sender’s  address  to  a  normal  message  structure  before  it  was  enc^sulated  in  an  IP 
datagram  and  sent  to  all  members.  An  illustration  of  the  extended  message  format  is 
shown  in  Figure  S3.  There  were  two  choices  as  to  how  to  handle  the  extended  format 
message  at  the  receiving  hosts.  The  first  choice  was  to  check  every  message  and  decide 
whether  it  was  a  normal  or  extended  format  message,  and  process  it  accordingly.  The 
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Figure  53:  Extended  Format  Mcaster  Message  Structure 

second  choice  was  to  have  two  separate  receiving  sockets;  one  for  normal  messages  and 
one  for  the  extended  format  messages  from  an  mcaster  multicast.  The  latter  method  was 
chosen  for  the  simplicity  and  clean  separation  it  provided  between  the  two  multicai^g 
methods,  as  shown  in  Figure  54.  The  drawback  to  the  chosen  method  was  that  an 
application  program  using  an  mcaster  multicast  would  have  to  manage  an  extra  socket  at 
all  levels  of  the  program,  virtually  eliminating  the  desired  transparency.  However,  the 
amount  of  overhead  required  to  manage  the  extra  socket  is  inngnificant,  and  the  use  of  the 
extra  receiving  socket  could  easily  be  hidden  in  a  separate  receive  routine  in  a  library  file, 
similar  to  the  join  and  leave  procedures  used  to  hide  the  access  to  the  two  multicasting 
methods. 

The  deviations  noted  above  prevent  the  user  from  being  totally  uiuware  of 
which  method  of  muhicasting  is  being  used:  an  IP  multicast  or  an  mcaster  multicast. 
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□  =  unicast,  IP  multicast  socket 
A  =  mcaster  socket 


=  normal  message 


=  extended 


message  (for  mcaster  mcast) 


Figure  54;  Multicasting  Using  Extended  Format  Mcaster  Messages 

However,  it  would  be  impossible  to  completely  remove  the  awareness  of  the  multicasting 
method  used,  since  an  IP  multicast  only  works  within  a  limited  range  of  IP  addresses,  and 
the  user  would  have  to  select  the  proper  IP  address  to  use  if  intent  on  using  IP 
multicasting.  The  deviations  from  IP  multicasting  listed  above  required  by  the  mcaster 
program  would  not  be  evident  in  the  normal  multicasting  of  IP  datagrams;  the  user  could 
confidently  select  an  IP  address  and  name  for  the  multicast  group  and  then  use  the  library 
calls  described  to  join  the  group,  send  and  recdve  multicast  and  unicast  messages,  and 
leave  the  group,  without  ever  being  aware  of  which  method  of  multicasting  was  being 
used.  Thus,  the  desired  level  of  transparency  in  multicasting  methods  was  achieved. 

c.  Mcaster  Algorithm 

The  algorithm  for  the  mcaster  program  is  listed  in  Figure  55  and  described 
as  follows.  Line  1  is  the  initialization  of  the  single  socket  used  by  the  mcaster  program  for 
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all  1/0.  Line  2  begins  the  main  loop  of  the  program,  an  infinite  loop  of  waiting  to  receive 
a  message,  then  processing  the  received  message  and  sending  a  reply  or  multicast  as 
necessary  Lines  3  and  4  describe  the  process  of  blocking  to  receive  an  incoming  message 
Lines  4  and  S  check  the  received  message  type  for  a  join  request.  Lines  5  through  12 
perform  the  join_group  sequence.  In  line  6,  the  group  list  is  searched  to  determine  if  the 
group  already  exists  or  not.  If  the  group  is  not  located  in  the  group  list,  lines  7  and  8  add 
the  new  group  to  the  list.  If  the  group  does  exist,  then  lines  9  and  10  determine  if  the 


M  caster 

t*  Emulates  IP  Multicast  using  iterative  unicasts  *! 

1.  initialize  socket  (group  address) 

2.  wait  for  incomii^  messages 

3.  when  message  received 

4.  if(messagetype  =  JOIN_GBlOUPorLEAVE_GROUP) 

5  if(JOIN_GROUP) 

6.  sean^  group  list  for  group  (group  name) 

7.  if(group  not  located) 

8  add  group  to  group  list 

9.  dse  /*  group  located  */ 

10.  search  member  list  for  men^(men^  address) 

11.  if(not  already  a  merhber) 

12.  add  member  to  member  tist 

13.  else  /•LEAVE_CiROUP*/ 

14.  locate  group  (group  name)  or  indicate  error 

15.  locate  meniber  (member  address)  or  indicate  error 

16.  ifOocated) 

17.  remove  member  fixxnmento' list 

18.  if(member  list  is  empty) 

19.  remove  group  fiom  group  list 

20.  form  ACK  message 

21.  send  ACK  message  to  requestir^  member 

22.  else  /*  multicast  to  all  members*/ 

23.  for  (all  mendxrs  in  specified  group) 

24.  if(not  sender  ex' loopback) 

25.  send  message  to  menfoer 

26.  goto  line  2 


Figure  55;  Mcaster  Algorithm 
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member  is  already  in  the  member  list  of  that  group.  If  it  is  a  new  group  or  if  the  member 
is  not  already  in  the  member  list,  then  the  member  is  added  to  the  member  list  of  the 
specified  group  in  lines  1 1  and  12.  Lines  13  through  19  perform  the  similar  procedure  for 
leaving  a  group.  Line  14  and  IS  locate  the  specified  member.  Line  16  and  17  remove  the 
member  from  the  member  list.  If  the  member  list  for  the  specified  group  is  now  empty, 
lines  18  and  19  remove  the  group  from  the  group  list  Lines  20  and  21  complete  the  join 
or  leave  sequence  by  forming  and  sending  an  acknowledgment  message  to  the  requesting 
member.  Lines  22  through  25  perform  the  multicast  of  any  message  other  than  a  join  or 
leave  request  Line  24  ensures  that  the  sender  does  not  receive  the  multicast  message  if 
the  no  loopback  option  is  selected.  Line  26  completes  the  main  loop  and  returns  to  line  2 
to  begin  again. 

The  actual  code  for  the  mcaster  program  is  included  in  the  Appendix  in  the 
program  file  mcaster.  c.  The  utility  functions  used  by  the  mcaster  program  are  included  in 
the  library  files  mcaster.h,  msutil.h,  and  msutilc,  also  included  in  the  Appendix. 

B.  MSERVER 

The  functioning  of  an  mserver  process  has  already  been  explained  from  a  procedural 
point  of  view.  The  monitoring  and  change>processing  protocols  defined  in  Chapter  IV 
each  explain  the  sequence  of  actions  performed  by  an  mserver  with  respect  to  one  aspect 
of  the  overall  operation  of  the  MS  and  an  mserver.  The  protocols  are  described  in  a 
procedural  form,  implying  that  an  mserver  performs  the  complete  set  of  actions  that  make 
up  each  protocol  sequentially  before  beginning  a  new  set  of  actions.  In  reality,  each 
mserver  must  continually  process  incoming  messages  and  changes  to  the  internal  state  of 
the  mserver  concurrently.  It  is  true  that  for  strong  membership  consistency  guarantees, 
only  one  change  will  be  committed  by  a  core-set  of  mservers  at  a  time;  however,  during 
the  processing  of  that  change,  many  other  events  must  be  registered  and  processed.  These 
other  events  include  the  reception  of  messages  of  all  types;  some  that  affect  the  current 


106 


change;  others  that  do  not,  but  must  be  stored  nonetheless,  and  some  that  require  an 
immediate  response,  such  as  a  monitoring  query  Other  events  include  the  expiration  of 
timers  or  a  change  in  the  internal  state  caused  by  processing  the  current  change 

Simply  put,  an  mserver  process  performs  three  primary  actions:  I)  it  receives  and 
stores  incoming  messages,  2)  it  changes  the  internal  state  in  response  to  internal  events  or 
incoming  messages,  and  3)  it  sends  outgoing  messages.  The  incoming  and  outgoing 
messages  may  be  unicast  or  multicast,  depending  on  the  circumstances.  In  this  section,  the 
operation  of  an  mserver  process  is  described  in  detail  from  an  implementation  viewpoint. 

1.  Internal  State  and  Data  Structures 

Each  mserver  process  has  a  dual  personality;  it  is  a  member  of  a  core-set  of  peer 
mservers,  as  well  as  the  parent  of  a  child-set  of  mservers.  For  this  reason,  the  set  of  data 
structures  and  variables  used  to  maintain  the  internal  state  of  an  mserver  must  be 
replicated  to  support  both  identities.  Additionally,  each  mserver  must  maintain 
information  about  all  application  groups  that  it  supports.  Figure  56  illustrates  the  data 
structures  and  variables  used  to  maintain  the  internal  state  of  an  mserver.  Each  of  these 
data  structures  will  be  described  in  detail  in  the  following  paragraphs.  Table  6  lists  the 
variables  used  to  maintain  the  mserver's  internal  state  and  their  meaning. 

As  shown  in  Figure  56,  an  mserver  maintains  two  core-tables;  one  for  the  peer 
core-set,  and  one  for  the  child-set.  The  core-tables  are  used  to  maintain  the  membership 
information  for  each  the  mserver's  core-sets.  The  index  into  the  table  is  the  gid  of  each 
mserver  in  the  core-set  The  IP  address  of  each  mserver  is  stored  to  allow  unicast 
message  addressing.  The  rank  of  each  mserver  is  maintained,  with  a  rank  of  0 
corresponding  to  the  highest  rank  and  most  senior  mserver  in  the  core-set.  The  cw  and 
ccw  variables  are  integer  pointers  representing  the  clockwise  and  counterclockwise 
neighbors  of  each  mserver.  These  links  represent  the  pairwise  monitoring-set;  the 
clockwise  neighbor  is  the  monitor  and  the  counterclockwise  neighbor  is  the  monitored 
mserver.  It  is  important  that  all  mservers  know  the  exact  monitoring  relationship  of  all 
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Figure  56:  Mserver  Data  Structures  and  Internal  State 

other  core-set  mservers,  so  that  the  correct  monitoring  arrangements  can  be  made  by  all 
each  time  the  core-set  membership  changes.  Figure  57  illustrates  the  structure  of  these 
core-tables,  and  Figure  58  illustrates  a  core-set  of  mservers  corresponding  to  the  core-set 
listed  in  the  core-table  of  Figure  57. 

Each  mserver  maintains  four  lists  for  each  side  of  internal  state:  an  mserver 
failures  list,  a  physical  change  requests  list,  an  application  group  change  requests  list,  and  a 
list  of  all  application  groups  supported  by  that  core-set.  The  failures  list  is  a  list  of  all 
mservers  that  have  been  detected  failed  by  this  mserver,  but  not  yet  processed  out  of  the 
core-set.  The  format  of  the  list  is  the  same  as  the  excludejist  and  subject  list  shown  in 
Figures  13  and  53.  The  physical  change  request  list  is  shown  in  Figure  59.  This  list  stores 
all  of  the  relevant  change  data  for  each  physical  change  request  received  at  an  mserver. 
The  data  is  copied  from  the  received  message  and  a  new  entry  is  added  to  the  list  of 
pending  changes.  The  application  group  change  requests  list  functions  in  the  same 
manner  as  the  physical  change  requests  list,  but  is  maintained  separately  to  simplify  the 
prioritization  of  pending  physical  and  application  change  requests.  The  list  of  application 
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TABLE  6;  MSERVER  INTERNAL  STATE  VARIABLES 
Note:  Separate  copies  of  all  state  variables  are  maintained  by  each  mserver  for 
the  core-set  and  child-set  of  which  it  is  a  member 


Variable 

Description 

group_name 

name  of  core-set 

groupaddress 

address  of  core-set 

group_size 

size  of  core-set 

groupview 

current  group  view  of  core-set 

authentication 

used  for  core-set  security 

mygid 

group  identity  number  of  this  mserver 

cw 

clockwise  neighbor  (monitor) 

ccw 

counterclockwise  neighbor  (monitored) 

coretable 

pointer  to  core-table 

excludelist 

list  of  mservers  to  be  excluded  from  core-set  due  to  failure 

subjectjist 

list  of  subjects  for  a  Merge  or  Split,  or  failed  mservers 

current_change 

pointer  to  structure  for  data  about  current  change 

previous_change 

pointer  to  structure  for  data  about  previous  change 

failures 

list  of  failed  core-set  mservers  waiting  to  be  processed 

requests 

list  of  pending  physical  change  requests  received  by  core-set 

app_changes 

list  of  pending  application  change  requests  submitted  to  mserver 

timeouts 

timeout  vector  (recv,  query,  reply,  messg,  ACK) 

retries 

retries  vector  (reply,  messg,  ACK) 

expectedtype 

message  type  expected  for  current  processing 

responses 

count  of  number  of  responses  (i4C.^rs  and  Coord  Fails) 

app^oupjist 

list  of  application  groups  supported  and  relevant  data 

groups  is  illustrated  in  Figure  60.  The  fields  in  each  entry  in  the  application  group  list 
represent  all  of  the  data  that  the  mserver  must  maintain  for  each  application  group 
supported.  By  keeping  the  data  stored  minimal,  scalability  is  maintained.  The 
group  name  is  the  string  name  for  the  application  group.  The  core  set  and  name  set 
fields  are  boolean  variables  to  indicate  whether  this  mserver  is  in  the  core-set  or  name-set 
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Figure  57;  Mserver  Core  Table 


Figure  58;  Mserver  Core-set  Corresponding  to  Core  Table  in  Figure  57 


110 


requests 


ilast_request 


gromp_ykem  I  aMidcr_fM 


Mst.type  tiibJect_cM 


sybjcct_a4dr 


siilIcct^nMk 


f«iScr_cM 

■HLtype 

MkJcct_|M 

MbJect.aSSr 

wljtc 

t_nuUi 

Mbjwt.lbt  1 

m 

Lu 

svhltct^iM  1 

groiip_vicw  MMlcr_gM 


^tJyp^  MbJect^gM 


siibjtct_ai4r 


TOirirri' 


fukLUM.lea 


mbJcct_IW 


Figure  59:  Mserver  Requests  List 
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Figure  60;  Mserver  Application  Groups  List 
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of  the  application.  The  members  field  is  a  list  of  child  mservers  with  application  members 
in  their  hierarchy.  Only  these  child  mserver  will  be  included  in  the  message  exchange  and 
change  processing  for  the  application  group  which  they  support.  Other  mservers  will  not 
be  impacted  by  the  changes  to  application  groups  which  they  do  not  support,  with  the 
exception  of  processing  core-set  changes  if  they  happen  to  be  in  the  core-set  for  the 
application.  Figure  61  shows  the  data  structure  used  to  store  information  related  to  the 
current  and  previous  changes.  The  current  change  structure  maintains  a  separate  exclude 
list  from  that  included  with  the  mserver's  internal  state,  so  that  any  changes  made  to  the 
exclude  list  while  processing  a  change  that  is  subsequently  dropped  do  not  affect  the  main 
exclude  list  of  the  mserver.  When  the  change  is  committed,  the  main  exclude  list  for  the 
core-set  is  updated  with  the  new  information  contained  in  the  current_change  exclude  list. 

2.  Algorithm  and  Explanation 

The  general  algorithm  for  an  mserver  is  listed  in  Figure  62.  As  described 
previously,  the  algorithm  for  the  mserver  allows  continual  processing  of  incoming 
messages  and  internal  events,  even  while  a  current  change  is  being  processed.  Outgoing 

current_change  previou5_change 


nr 


exclude  list 

1 

Figure  61;  Mserver  Current  and  Previous  Change  Storage 


messages  can  be  sent  at  the  same  time  as  well.  The  line-by-line  description  of  the 
algorithm  follows. 

The  mserver  is  started  with  a  comnuuid  line  call,  as  shown  in  the  header  of  Figure 
62.  The  group_address  is  only  included  if  this  mserver  is  the  first  to  join  a  new  core-set, 
thus  creating  the  core-set.  Lines  1  and  2  make  system  calls  to  obtain  information  about 
the  host  computer  and  core-set  multicast  group.  These  calls  are  made  with  the  name  as  an 


Mserver  (group^name,  [group  address]) 

/*  group_name  is  the  string  name  of  the  core-set,  group__address  is  an  optional  IP 
address  of  the  multicast  group  for  the  core-set,  included  only  when  a  new 
core-set  is  being  created  */ 

/*  Initialize  mserver  */ 

1 .  obtaininfb  (host_name,  host_address) 

2.  obtain_info  (groupjiame,  groupaddress) 

3.  initialize  sodcets  (ms,  me) 

4.  initialize  (internal  state) 

5.  set  timers  (reev,  query,  reply,  messg,  ACK) 

/*  Join  core-set  */ 

6.  send_message  (Join_message,  group_address) 

7.  messg  =  Rdktble_receive  (Inhjnessage,  M^Queiy_me$sage) 

8  if  (messg  =  Inh  message)  /*  successfiil  join  */ 

9.  join_mcast_group(group_name,  groupaddress) 

10.  update  Cmtemal  state) 

11.  else  /*  unsuccessful  jdn  */ 

12.  exit  and  report  error 

f*  Begin  main  processing  loop  *! 

13.  for  (; ;)  /•  infinite  loop  •/ 

14.  receive_message  (messg,  recv_timeout) 

1 5.  update  Cmtemal  state) 

16.  process_message  (messg,  internal  state) 

1 7.  process_lists  Cmtemal  state) 

18.  processjtimeouts  Cmtemal  state) 

Figure  62;  Mserver  Algorithm 
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input  argument,  and  return  the  respective  address.  The  core-set  multicast  address  is 
obtained  from  a  locally  maintained,  well-known  file,  mapping  group  names  to  multicast 
addresses  in  the  local  environment,  if  the  group  already  exists.  If  the  group  does  not  exist, 
the  procedure  registers  the  group  name  and  corresponding  group  address  in  the  file 
Lines  3  initializes  the  two  sockets  used  by  the  mserver:  ms  for  incoming  unicast  messages 
and  all  outgoing  messages,  and  me  for  outgoing  multicast  messages  (to  utilize  the  mcaster 
utility,  if  needed).  Line  4  initializes  the  internal  state  of  the  mserver,  represented  by  the 
data  structures  and  variables  for  each  part,  as  shown  in  Figure  S6.  Line  S  initializes  all 
timeout  variables  used  as  timers  for  the  reception  of  messages. 

Now  that  the  mserver  has  been  created  and  initialized,  lines  6  through  12  control 
the  mserver's  attempt  to  join  the  desired  core-set.  If  a  new  core-set  is  being  created,  there 
is  no  need  to  send  and  receive  messages  to  join  a  core-set.  The  mserver  simply  updates 
the  internal  state  to  reflect  that  it  is  the  only  mserver  in  the  new  core-set.  If  the  core-set 
already  exists,  a  join  request  message  is  sent  to  the  core-set  multicast  address  in  line  6, 
followed  by  a  Reliable_receive  of  the  Initial  Parameters  message  from  the  core-set  in  line 
7  The  Initial  Parameters  message  contains  the  complete  state  of  the  sending  mserver, 
which  was  the  coordinator  for  the  join  request  of  this  new  mserver  to  the  core-set.  Since 
the  state  of  the  coordinator  is  also  the  state  of  all  other  mservers  in  the  core-set,  the 
joining  mserver  receives  the  complete  state  of  the  core-set  in  this  message.  An  illustration 
of  the  Initial  Parameters  message  is  shown  in  Figure  63.  If  the  Initial  Parameters  message 
is  returned,  the  mserver  joins  the  multicast  group  for  the  core-set.  This  is  a  separate 
action  from  joining  the  core-set;  the  core-set  must  have  been  joined  first  before  allowing 
the  new  mserver  to  become  a  member  of  the  core-set  multicast  group.  If  for  some  reason 
the  joining  mserver  is  unable  to  join  the  multicast  group,  it  will  exit  and  return  an  error 
code  to  the  caller.  The  core-set  will  soon  detect  the  new  mserver  failed  and  remove  it 
from  the  group  automatically. 
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After  successfully  joining  the  core-set,  the  mserver  begins  the  main  loop  of 
operation,  shown  in  lines  13  through  18.  The  mserver  continually  repeats  a  cycle  of 
receiving  any  incoming  messages,  processing  the  received  message,  then  processing  any 
pending  failures  or  requests,  and  finally  checking  the  internal  timers  to  determine  if  any 
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Figure  63;  Initial  Parameters  Message  Format 


messages  have  exceeded  their  timeouts.  In  line  14  a  timed  receive  function  is  used;  the 
process  is  idle  waiting  for  any  incoming  message  or  the  timeout  period  to  elapse.  This  is 
similar  to  a  combination  of  the  select  and  recyfrom  UNIX  socket  calls.  The  timeout  for 
the  receive  function  is  a  relatively  short  period,  and  in  the  absence  of  any  incoming 
messages,  acts  as  the  clock  for  the  mserver.  Each  iteration  of  the  main  loop  represents 
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one  "tick"  of  the  logical  event  clock  for  the  mserver.  All  other  timeouts  used  are  multiples 
of  this  basic  receive  timeout,  so  that  messages  are  timed  in  terms  of  a  real  clock  as  well  as 
the  logical  event  clock.  When  the  receive  function  returns,  either  a  message  has  been 
received,  or  the  timeout  period  expired  If  a  message  was  received,  it  is  processed  in  line 
16  The  message  is  decoded,  and  the  appropriate  action  taken  depending  on  the  message 
type. 

Next,  the  failures  list,  requests  list,  and  application  group  requests  list  are 
checked  for  pending  items.  The  lists  are  checked  this  order,  so  that  the  failures  list  has 
priority  Any  mserver  failures  queued  are  "batched"  and  processed  as  one  change  to  the 
core-set  membership,  with  the  rank  of  the  highest  ranking  subject  used  for  the  change 
priority.  Upon  completion  of  processing  the  failures  list,  the  requests  list  and  application 
group  requests  lists  are  checked,  in  that  order.  Only  one  pending  change  is  processed 
each  main  cycle;  the  request  at  the  head  of  the  selected  queue  is  copied  into  the 
current_change  structure  and  processed  as  the  current  change.  The  request  is  not 
removed  from  the  head  of  the  queue  until  the  change  is  committed,  so  that  if  the  change  is 
dropped,  the  request  will  remain  at  the  head  of  the  queue.  Once  the  change  is  committed, 
the  change  data  is  copied  from  the  current  change  structure  to  the  previous_change 
structure. 

Finally,  all  timers  are  checked  to  see  if  any  expected  message  has  exceeded  the 
associated  timeout  period.  If  any  timers  expired,  the  internal  state  is  updated,  messages 
are  sent  as  needed,  and  the  timers  are  reset.  This  completes  the  main  loop  processing, 
which  is  then  repeated  continually.  The  code  for  a  partial  implementation  of  an  mserver 
process  is  included  in  the  Appendix  in  file  mserver.c. 

C.  MEMBER  INTERFACE 

As  discussed  previously,  the  primary  purpose  for  the  MI  is  to  act  as  an  inter&ce 
between  the  application  and  the  MS  hierarchy.  The  MI  accepts  membership  change  and 
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information  requests  from  the  iqjplication  processes  and  relays  the  requests  to  the  LAN 
mservers  for  processing.  When  the  change  has  been  processed,  the  Ml  accepts  and  relays 
the  change  data  to  the  application  processes.  The  MI  must  also  respond  to  periodic 
monitoring  queries  from  the  LAN  mserver. 

1.  Internal  State  and  Data  Structures 

The  MI  maintains  a  limited  amount  of  information  about  the  MS  hierarchy  and 
the  application  process  group  members  that  it  supports,  as  shown  in  Figure  64.  The  only 
MS  hierarchy  information  stored  is  the  name  and  IP  address  of  the  LAN  mserver.  The  MI 
maintains  a  list  of  the  application  groups  that  it  supports  This  list  is  very  similar  to  the 
application  groups  list  maintained  by  mservers,  except  the  Ml  is  not  part  of  any  core-set  or 
name-set.  Additionally,  the  MI  must  maintain  a  list  of  all  members  running  on  the  host 
computer  for  each  application  group.  Information  about  other  member  processes  is 
obtained  by  requests  to  the  MS  hierarchy.  An  optional  QoS  feature  would  allow  the  MI 
to  store  limited  information  about  all  application  member  processes  for  a  particular 
application,  thus  increasing  the  storage  r^uirements  at  the  host  computer,  but  greatly 
reducing  the  latency  to  obtain  member  information. 


Figure  64;  MI  Data  Structures  and  Internal  State 
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2.  Algorithm  and  Explanation 

Figure  65  lists  the  algorithm  used  by  the  MI.  The  algorithm  is  similar  to  that  of 
an  mserver,  but  much  simpler.  The  same  idea  of  a  continual  cycle  of  receiving  messages, 
processing  the  messages,  processing  pending  requests,  and  processing  expired  timeouts  is 
performed.  The  timed  receive  function  is  used  again,  so  that  receive  cycles  act  as  the 
internal  event  clock.  The  initialization  in  lines  1  through  1 1  is  very  similar  to  that  of  an 


MI  (mservername) 

/*  mservername  is  the  string  name  of  the  LAN  mserver  */ 

/*  Initialize  MI  */ 

1 .  obtaininfo  (host_name,  hostackiress) 

2.  obtaininfb  (mservername,  mserveraddiess) 

3.  initialize  socket  (ms) 

4.  initialize  Cmtemal  state) 

5.  set  timers  (recv,mes^ 

/*  Register  with  LAN  mserver  */ 

6  send_message  (Join_message,  mserver_address) 

7.  messg  =  Keiiabie_iieceive  (ACK_message,  M^Quarymessags) 
8  if  (mesrg  —  ACK_tnessage)  /*  successfiil  */ 

9.  update  Cmtemal  state) 

10  else  /*  unsuccessful  */ 

11.  exit  and  report  error 
/*  Begin  main  loop  */ 

12.  for  (; ;)  /*  infinite  loop  */ 

13.  receive_message  (messg,  recv_timeout) 

14.  update  Cmtemal  state) 

15.  processmessage  (messg,  iittemal  state) 

1 7.  process_iists  (internal  state) 

18.  process  timeouts  Cmtemal  state) 


Figure  65:  Ml  Algorithm 


mserver,  except  the  MI  does  not  join  a  group,  but  instead  registers  with  the  LAN  mserver. 
The  main  loop  of  lines  12  through  18  is  nearly  identical  to  that  of  an  mserver,  except  that 
there  are  many  fewer  events  to  process  at  each  stage.  The  only  messages  received  by  an 
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Ml  are  application  membership  change  and  information  requests.  Direct  messages  from 
the  LAN  mserver,  and  monitoring  Query  messages  from  the  LAN  mserver  The  only 
messages  that  an  MI  sends  are  Submit  messages  for  application  change  requests.  Reply 
monitoring  messages  to  the  LAN  mserver,  and  configuration  messages  to  the  MS.  The 
Ml  only  needs  to  track  two  timers:  the  main  receive  timer  and  a  message  timer  for 
expected  responses  Incoming  requests  are  added  to  a  pending  requests  list  if  a  current 
request  has  been  submitted  to  the  LAN  mserver.  As  each  change  request  is  completed 
with  the  reception  of  a  Direct  message,  a  new  request  is  taken  from  the  head  of  the  queue 
and  processed. 
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VII.  CONCLUSIONS  AND  FUTURE  WORK 


A.  CONCLUSIONS 

This  thesis  presented  a  globally  scalable,  fully  decentralized  group  membership  service 
which  provides  the  framework  for  distributed  applications  of  any  size  and  distribution  A 
complete  description  of  the  architectural  design  and  protocol  specifications  were 
presented,  and  an  implementation  of  the  membership  service  was  described. 

The  most  significant  contribution  of  the  group  membership  service  described  herein  is 
the  total  scalability.  The  MS  provides  a  consistent  suite  of  services  to  client  applications 
distributed  on  any  scale,  from  a  single  LAN  to  the  worldwide  internetwork.  No  other 
membership  protocol  or  service  provides  this  capability.  Other  noteworthy  contributions 
include  the  decentralized  and  efficient  nature  of  the  MS,  and  the  concept  of  providing 
numerous  user-selectable  Quality  of  Service  options  to  tailor  the  MS  to  the  needs  of  each 
client  application. 

B.  FUTURE  WORK 

Although  a  great  deal  of  work  was  accomplished  .n  the  design  and  partial 
implementation  of  the  MS  described  in  this  thesis,  much  more  work  remains  to  be  done. 
First,  to  demonstrate  that  the  MS  is  truly  scalable  to  global  proportions,  a  working 
implementation  of  the  complete  MS  must  be  developed  and  installed  on  progressively 
larger  scales.  Second,  complete  performance  analysis  of  the  operation,  overhead,  network 
constraints,  service  latency,  and  functionality  of  the  MS  must  be  accomplished.  Third,  a 
complete  formal  specification  of  the  protocols  used  by  the  MS  must  be  accomplished,  with 
a  reachability  analysis  conducted  to  demonstrate  formally  correct  operation.  It  is 
recommended  that  an  extended  communicating  finite  state  machine  (CFSM)  modeling 
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method  be  used,  such  as  Systems  of  Communicating  Machines  (SCM)  [30],  for  the  formal 
specification  Finally,  the  MS  must  be  modified  to  take  advantage  of  the  reliable, 
high-speed  networks  which  are  currently  being  deployed.  Advances  in  network 
technology,  such  as  ATM  (Asynchronous  Transfer  Mode)  and  Sonet  (Synchronous  optical 
network),  provide  a  different  network  model  than  the  conventional  IP-based  model  used 
for  the  design  of  this  MS  The  conceptual  design  described  in  this  thesis  remains  valid  for 
any  network  model;  however,  by  adapting  the  protocol  specifications  to  take  advantage  of 
reliable,  high-speed  networks,  a  more  efficient  and  capable  version  of  the  MS  can  be 
realized. 
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APPENDIX 


ft:1f*t*^H,************************************************************** 

*  MCASTER.H  ver  1.0 

*  Header  file  for  MCASTER  C 

Program  to  emulate  IP  multicast  in  a  unicast  environment 

* 

*  Written  by  Dave  Neely,  March  1994. 

*  Modified:  4/23/94 

IK***  *******41*  «********«  ******  *********  ****»**««4i*«*«y 


#define  MC_PORT 

3000 

#defineJOIN_GROUP 

120 

#define  LEAVE_GROUP 

121 

#define  JOIN_ACK 

130 

#define  DUP_MEMBER 

131 

#define  NEGJOIN 

132 

#define  LEAVE_ACK 

140 

#define  NO_GROUP 

141 

#define  NO_MEMBER 

142 

#define  NEG_LEAVE 

143 

#define  NO_LOOP 

0 

#define  LOOP 

1 
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y***«**4i***4i4ii»*4i«*«******«****««*«*«»**««***««*«i«*****««*«*«*»**4t««**« 

*MSUTIL.Hver  1.0 

*  Header  file  for  Membership  Service  (MS)  utilities 


*  Written  by  Dave  Neely,  March  1994. 

*  Modified;  4/23/94 


***^nf*<$****************************m*************************0*******f 


#include  <stdio.h> 
#include  <sys/types.h> 
#include  <sys/socket.h> 
#include  <net/if  h> 
#include  <sys/ioctl.h> 
#include  <netinet/in.h> 
#include  <string.h> 
#include  <netdb.h> 
#include  <sys/time.h> 


#define  VERS 

1 

#define  MS_PORT 

2900 

#define  MAXGROUPNAME 

32 

#define  MAXMSGLEN 

1024 

#define  SEC 

1000000 

#define  T_RECV 

1*SEC 

#define  T_REPLY 

5 •  T_RECV 

#define  T_MESSG 

T_REPLY 

#define  T_ACK 

T_REPLY 

#define  T_QUERY 

60*SEC 

#define  MAXTBLSIZE 

250 

#define  MAXTIME 

0x7fffifff 

/*  version  number  */ 

/*  unicast  &  IP  multicast  */ 

/*  chars  */ 

/*  IkB  */ 

/*  1  million  usee  */ 

/•  reev  cycle  timeout  */ 

/*  incoming  reply  timeout  */ 

/*  incoming  messg  timeout  */ 

/•  incoming  ACK  timeout  */ 

/*  timeout  to  send  another  query  */ 
/*  max  size  of  group  table  */ 

/*  to  reset  timeouts  */ 


#ifdef  IFF_MULTICAST 
#ifhdef  MULTICAST 
#define  MULTICAST 


#endif 

#endif 


/*  monitoring  message  types  *! 


#define  QUERY 

0 

/*  monitoring  query  */ 

#define  REPLY 

1 

/*  monitoring  reply  *! 

/*  mserver  INITIATE  message  types  */ 

#define  MJOIN 

10 

/*  mserver  join _^oup*/ 

#define  M_LEAVE 

11 

/*  mserver  leave  */ 

#define  M_SPLIT 

12 

/*  mserver  split  jgroup  */ 

#define  M_MERGE 

13 

/*  mserver  merge_^oup  •/ 

#define  M_ADD_PARENT 

14 

/*  mserver  add  parent  •/ 

#define  M_DEL_PARENT 

15 

/*  mserver  delete  parent  */ 

#define  M_STATS_S 

16 

/*  mserver  group  stats  -  short  */ 

#defineM_STATS_L 

17 

/*  mserver  group  stats  -  long  */ 

#define  M_FAIL 

18 

/*  mserver  fail  */ 

#define  M_MULTI_FAIL 

19 

/*  multiple  mservers  fail  */ 

#define  M_COORD_FAIL 

20 

/*  coordinator  fail  */ 

!*  change  sequence  message  types  */ 

#derine  ACK 

21 

#define  COMMIT 

22 

#define  WAIT 

23 

#define  MSG_QUERY 

24 

#define  INIT 

25 

/*  external  physical  requests  to  core-set  */ 

#define  M_JOIN_REQ 

30 

/*  mserver  join_^oup  request  */ 

#define  M_LEAVE_REQ 

31 

/*  mserver  leave_request  */ 

#defineM_SPLIT_REQ 

32 

/*  mserver  split  jgroup  request  *! 

#define  M_MERGE_REQ 

33 

1*  mserver  merge_^oup  request  */ 

#define  M_ADD_PAR_REQ 

34 

/*  mserver  add  parent  request  */ 

#define  M_DEL_PAR_REQ 

35 

/*  mserver  delete  parent  request  */ 
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#define  M_STATS_S  _REQ 

36 

/*  mserver  group  stats  -  short  */ 

#define  M_STATS_L  _REQ 

37 

/*  mserver  group  stats  -  long  */ 

/*  application  group  INTITIATE  message  types  */ 

#define  AJOIN 

70 

/*  app  joinjgroup  */ 

#define  A_LEAVE 

71 

/*  app  leave _^oup  •/ 

#define  A_SPUT 

72 

/*  app  split ^oup  */ 

#define  A_MERGE 

73 

/*  app  merge_group  */ 

#define  A_STATS_S 

74 

/*  app  group  stats  -  short  */ 

#define  A_STATS_L 

75 

/*  app  group  stats  -  long  */ 

#define  SUBMIT 

76 

1*  app  change  submission  */ 

#define  DIRECT 

77 

/*  parent's  change  directive  */ 

/*  application  group  request  message  types  */ 


#define  A_JOIN_REQ 

80 

/*  app  join  jgroup  request  */ 

#define  A_LEAVE_REQ 

81 

/*  app  leave jgroup  request  */ 

#define  A_SPUT_REQ 

82 

/*  app  split _SJ‘oup  request  */ 

#define  A_MERGE_REQ 

83 

/*  app  merge jgroup  request  */ 

#define  A_STATS_S_REQ 

84 

1*  app  group  stats  -  short  */ 

#define  A_STATS_L_REQ 

85 

1*  app  group  stats  -  long  */ 

struct  table_entry  { 

/*  member's  entry  in  set  table  */ 

ujong  addr; 

/*  IP  address  of  member  */ 

u  short  rank; 

/*  member's  rank  *f 

u  short  cw; 

/*  gid  of  clockwise  member  (to  "left")  */ 

u  short  ccw; 

/*  gid  of  counterclockwise  member  */ 

uchar  flagl; 

/*  status  flag  for  each  member  */ 

u  char  flag2; 

}; 

/*  status  flag  for  each  member  */ 

struct  gid_entry  { 

/*  used  for  gid  lists  */ 

u  short  gid; 

/*  member’s  group  ID  */ 

u  short  rank; 

/*  member's  rank  */ 

ujong  addr; 

/*  IP  address  of  member  */ 
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struct  gid_entry  *next; 


}; 

struct  message  {  /*  to  build  and  receive  messages  *! 

u  short  vers; 
int  checksum; 

char  group_name[MAXGROUPNAME]; 

u  short  group_view; 

long  authentication; 

u_short  senderjgid; 

u  short  msg_type; 

u_short  subject _gid; 

u  long  subject_addr; 

u_short  subject  rank; 

struct  ^d_entry  *excludejist; 

u_short  excijistjen; 

struct  gid_entry  *subjectjist; 

u_shon  subjjistjen; 

char  *data; 

int  datajen; 

}  messg; 

struct  group  statej  /*  core  and  child  set  internal  state  */ 

char  group_name[MAXGROUPNAME]; 
struct  sockaddrjn  groupaddr; 
u  short  group_size; 
u_short  group  view; 
long  authentication; 
u  short  mygid; 
u  short  cw; 
u_short  ccw; 
struct  table_entry  table; 
struct  gid_entry  *excludejist; 
u_short  excijistjen; 
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struct 

gid_entry  *subject_Ust; 

u_short 

subjjistjen; 

char 

*data; 

int 

datajen; 

struct 

change_data  *current_change; 

struct 

change_data  *previous_change; 

struct 

gid  entry  •fiulures; 

struct 

gid_entry  *Iast_faiiure; 

struct 

messg_entry  *requests; 

struct 

inessg_entry  *last_request; 

struct 

messg_entry  *app_requests; 

struct 

messg_entry  *last_app_request; 

struct 

timeval  recv_timeout; 

struct 

timeval  query_timeout; 

struct 

timeval  reply_timeout; 

struct 

timeval  messg_timeout; 

struct 

timeval  ACK_timeout; 

u_short 

rretries; 

u_short 

m_retries; 

u_short 

a_retries; 

u_short 

ACKcount; 

u_short  change_underway; 

u  short 

elect  coordinator; 

struct  change_data  {  /*  current  and  previous  change  info  */ 

u_short  coordinator; 
u_short  subjjgid; 
ujong  subject_addr; 
u_short  subj_rank; 

u_short  group_name[MAXGROUPNAME]; 
u_short  type; 
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struct  messg_entry  {  /*  entry  in  requests  lists  *! 

char  group_iuune[MAXGROUPNAME]; 
u_short  group_view; 
long  authentication; 
u  short  sender _^d; 
u_short  msg_type; 
u_short  subject _gid; 
ujong  subject_addr; 
u_short  subject_rank; 
struct  gid_entry  *exclude_list; 
u  short  excl_listjen; 
struct  gid_entry  •subjectjist; 
u_short  subj_UstJen; 
char  *data; 
int  datajen; 
struct  inessg_entry  *next; 

}; 
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. . . . ••••••••••♦••••*•••••***•***••**•***•*•••* 

*  MSUTIL.C  vcr  10 

*  Utility  procedures  used  by  Membership  Service  programs. 

* 

*  int  init_socket(sin,  port) 

*  int  join_mcast_group() 

*  void  leave_mcast  jroup() 

*  int  addrcmp(addrl,  addr2) 

*  void  form_messg(messg,  group,  authentication,  groupview,  sender,  type,  subject, 

*  excijist,  excljist Jen,  subjjist,  subj_list_len,  data,  datajen) 

*  int  send_messg(socket,  message,  dest) 

*  int  recv_messg(ms_socket,  mc  socket,  message,  sender,  timeout) 

*  void  set  timeoutO 

*  int  timed  outO 

*  int  search^d_list(gid_Iist,  gid) 

*  int  addjgid_entry(&gidjist,  gid) 

*  int  copy_gidJist(gidJist,  &buffer) 

*  int  extract_gid_list(buflFer,  &gidjist,  listjen) 

*  void  deIete_gid_list(&gid_Ust) 

*  void  print_in_addr(in_addr) 

*  void  print_sock_addr(sin) 

*  void  print_sock_info(s,  sin) 

*  void  print_hostent(hp) 

*  void  print_messg(messg) 

*  void  print _gid_Ust(gid_list) 

* 

*  Written  by  Dave  Neely,  March  1994. 

*  Modified:  4/26/94 


#include  "msutil.h" 
#include  "mcaster.h" 
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int  init_socketO; 

int  joinmcast ^oup(); 

void  leave_mcast ^oup(); 

int  addrcmpO; 

void  forni_messg(); 

int  send_mes^); 

int  recv_messg(); 

void  scttimeoutO; 

int  timed_out(); 

int  search _ipd_list(); 

int  add _gid_entiy(); 

int  copy jgid_list(); 

int  extract ^d_list(); 

void  delete _^d_list(); 

void  print_in_addr(); 

void  print_sock_addrO; 

void  print_sock_info(); 

void  print_host«Jt(); 

void  print_messg(); 

void  print  ^djist(); 

/*  global  variables  */ 

struct  sockaddrjn  sin,  mcsin;  /*  socket  addresses  */ 

struct  sockaddr  jn  group_addr;  /*  group  mcast  address  */ 

int  ms,  me;  /*  IP  socket  fd's  •/ 

#ifdef  MULTICAST 

struct  ip_mreq  imr;  f*  IGMP  control  */ 

#endif 

int  dd>ug  =  0;  /*  1  =  enable  diagnostic  prints  */ 


*  Initialize  socket  address  structure 

**«********«**«***««*«*««*****4i*****4i*******************4i**4>*4t**4i*4i**^ 

int  init_socket(sin,  port) 


/*  socket  address  */ 


struct  sockaddrjn  sin; 

u_short  port; 

{ 

int  s;  /*  socket  fd  */ 

int  one  =  1 ; 

bzero((char*)&sin,  sizeof(sin));/*  clear  address  structure  and  initialize  */ 
sin.sinfamily  =  AFINET; 
sin.sinjx)rt  =  htons(port); 
sin.sin_addr.s_addr  =  htonI(INADDR_ANY); 

/*  open  and  bind  UDP/IP  socket  */ 
if  ((s  =  socket(AF_INET,  SOCK_DGRAM.  0))  <  0)  { 
perror(''can't  open  socket"); 
exit(-l), 

} 

if  (setsockopt($,  SOL_SOCKET,  SO_REUSEADDR,  &one,  sizeof^one))  <  0)  { 
perror("can't  make  socket  reuseable"); 
exit(-l); 

} 

if  (bind(s,  (struct  sockaddr  *)  &sin,  sizeof(sin))  <  0)  { 
perror("can't  bind  socket"); 
close(s); 
exit(-l); 

} 

return  s; 

} 

*  Join  IP  multicast  group  or  mcaster  group  (if  unicast  only). 

itt*******************************************************************/ 

int  join_mcast_group(group_name,  group  str  addr) 


/*  group  string  name  */ 
/*  IP  dot  address  */ 


char  '*group_name; 
char  *group_str_addr; 

{ 

u_char  loop  =  0;  /*  turn  loopback  option  off  */ 

int  len,  sent; 

struct  sockaddr_in  from;  /*  to  receive  sender's  address  */ 

struct  timeval  timeout;  /*  time  to  wait  for  response  */ 

timeout.tv  sec  =  30*SEC;  I*  wait  max  of  30  seconds  *! 

timeout,  tvusec  =  0; 

/*  set  up  group  address  structure  *! 
group_addr.sin_family  =  AF_INET; 
group_addr.sin_port  =  htons(MS_PORT); 
group_addr.sin_addr.s_addr  =  inet_addr(group_str_addr); 
printfCGroup  Address:\n''); 
print_sock_addir(group_addr); 

#ifdef  MULTICAST  /*  join  IGMP  multicast  group  */ 

imr.imr_multiaddr.s_addr  =  inet_addr(group_str_addr); 
printfCgroup  address;  “/osXn",  inet_ntoa(imr.imr_multiaddr.s_addr)); 
imr.imrjnterface.s_addr  =  htonl(INADDR_ANY); 
if  (setsockopt(ms,  IPPROTO_IP,  IP_ADD_MEMBERSHIP, 

&imr,  sizeof(imr))  <  0)  { 
perror("can't  join  group"); 
return  NEGJOIN; 

} 

if  (setsockopt(ms,  IPPROTO_IP,  IP_MULTICAST_LOOP, 

&loop,  sizeo^loop))  <  0)  { 
perrorC'can't  disable  multicast  loopback"); 

} 

printf("group  %s  joined\n",  inet_ntoa(imr.imr_multiaddr.s_addr)); 
return  JOINACK; 
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#dse  /*  join  MCASTER  multicast  emulator  group  */ 

/*  generate  and  send  join  request  message  to  MCASTBR  */ 
form_messg(&messg,  group_nanje,0,0,0,  JOIN_GROUP.0,0,0,0.0,0,0); 
len  =  sizeof|[messg); 

printf("SENDING  JOIN  MESSAGE  \n“); 
printf("message  to  send;  Xn"); 
print_messg(messg); 

sent  =  send_messg(ms,  messg,  group  addr); 
printf(''%d  bytes  sent\n'',  sent); 

/*  wait  for  ACK  message  from  MCASTER  */ 
if  ((sent  =  recv_messg(ms.  me,  &messg,  Afrom,  timeout))  <  0) 
printf("error  in  message  rec'd\n'*); 
else  { 

printfC'MESSAGE  RECEIVED;\n"); 
printf("%d  bytes  received\n",  sent); 
print_messg(messg); 
printf("sender:\n"); 

print_sock_addr(from); 

} 

#endif 

return  messg.  msg_type; 

} 

*  Leave  IP  multicast  or  mcaster  group. 

*****4iiti*4i«**4i4i4i**«*4i*«*4i*i»*«««4i4i*i|i4>«4i*******  ******************** 

void  Ieave_mcast_group(group_name) 

char  *  group_name; 

{ 

int  len,  sent; 
short  message_type; 
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struct  sockaddr  Jn  from; 
struct  timeval  timeout. 


/*  to  receive  sender's  address  */ 


set_timeout(&timeout,  30*SEC);  /*  wwt  30  seconds  */ 

/•  leave  group  */ 

#ifdef  MULTICAST 

if  (setsockopt(s,  IPPROTO_IP,  IP_DROP_MEMBERSHIP. 

&imr,  sizeof(struct  ip  mreq))  <  0)  { 
perrorC’can't  leave  group"); 
exit  (-1); 

} 

else 

printf("group  %s  left\n",  group_iuune); 

#else 

/*  generate  and  said  leave  request  message  to  MCASTER  */ 
form_messg(&messg,  group_name,0,0,0,  LEAVE_GROUP,0,0,0,0,0,0,0,0); 
len  =  sizeof(messg); 

printf("SENDING  LEAVE  MESSAGEAn"); 
printf("message  to  send;  \n"); 
print_messg(messg); 

sent  =  send_messg(fns,  messg,  group  addr); 
printf("%d  bytes  sent\n",  sent); 

/*  wait  for  ACK  message  from  MCASTER  */ 
if  ((sent  =  recv_messg(ms,  me,  &messg,&from,  timeout))  <  0) 
printf("error  in  message  rec'd\n"); 
else  { 

printf("MESSAGE  RECEIVED.  V); 

printf("%d  bytes  receivedNn",  sent); 

print_messg(messg); 

printfl[''sender:\n"); 

print_sock_addr(from); 

message_type  =  ntohs(messg.msg_type); 
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printf(''m«ssage_type  =  %d\n'’,  message jype); 
if  ((strcmp(messg.group_name,  group_name))  || 

(!(message_type  —  LEAVEACK))) 

printfC'unable  to  leave  group:  error  %d\n''.  mcssage_type). 

} 

#endif 

} 

^**0****************************************************************** 
*  Compare  two  sockaddr  in  stmcts..  return  1  if  same,  0  otherwise 

int  addrcn;p  (addrl,  addr2) 

struct  sockaddr  in  adarl ; 
struct  sockaddr  in  addr2; 

{ 

return  ((addrl  .sin_family  =  addr2.sin_famiiy)  && 

(addrl.  sin_port  ~  addr2.sin_port)  && 

(addrl  .sin_addr.s_addr  ==  addr2.sin_addr.s__addr)); 

}  /*  addrcmp  */ 


ft:Jti1f*****^*^Hit*0t***************************************************** 

*  Compose  message.  Copy  all  integer  values,  point  list  and  data 

*  pointers  to  appropriate  list  or  data  string. 

void  tbrm_messg(messg,  group,  auth,  GV,  sender,  type,  subject,  excljist, 
excMistJen,  subj_list,  subj_list_len,  data,  data  len) 

struct  message  *messg; 

char  *group;  /*  group  string  name  */ 

long  auth;  /*  authentication  number  */ 

u_short  GV;  /*  group  view  number  */ 
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u_short  sender;  /*  sender  gid  */ 

u_short  type;  /*  message  type  */ 

u  short  subject;  /*  subject  gid  */ 

struct  gid_entry  *excl  jist;  /*  gid  exclude  list  */ 

ushort  excljistjen; 

struct  gid  entry  *subj_list;  /•  gid  subject  list  */ 

u_short  subjjistjen; 
char  ’"data; 
u_short  datajen; 

{ 

I*  b2ero((char  ""jmessg,  sizeofC^messg));  */ 
messg->vers  =  VERS; 

messg->checksum  =  htons(Oxfi!¥); 
strcpy(messg->group_name,  group); 
messg->authentication  =  htons(auth); 
messg->group_view  ==  htons(GV); 
messg->sender _jgid  =  htons(sender); 
messg->nisg_type  =  htons(type); 
messg*>subject_^d  =  htons(subject); 
messg->exclude_list  =  excljist; 
messg->excl_list_len  =  htons(excl_list_len); 
messg->subject_list  =  subjjist; 
messg->subjjistjen  =  htons(subj_listJen); 
messg->data  =  data; 

messg->datajen  =  htons(dataJen); 

} 

|iH^1Hi*^^**i^**1HH^^H^***^^*1^**iHH^ltiiH^m*********^^*******^^******^^**m****^^*1^*** 

*  Send  a  variable  length  message  "messg".  The  message  may  contain  2 

*  lists  of  gids,  and  data  field.  These  are  appended  to  the  buffer, 

'messgbuf,  used  to  store  the  overall  message.  Returns  the  number 

■"  of  bytes  sent. 
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int  send_messg(s,  messg,  to) 

int  s;  /*  socket  fd  */ 

struct  message  messg; 
struct  sockaddr  in  to; 

{ 

int  sent; 

int  msglen  =  sizeof(messg)  +  (ntohs(messg.excl_listJen)  + 

ntohs(messg.subjJistJen))*2  +  ntohs(messg.dataJ«i); 
char  *messgbuf,  *mp;  /*  message  buffer  and  pointer  */ 

char  *datastr;  /*  for  diagnostic  prints  •/ 

int  i; 

u_short  val,  *up;  /*  for  diagnostic  prints  */ 

if  (debug)  printfCmessglen  to  send;  %d\n'',  msglen); 
if (!(messgbuf  =  (char*)  malloc  (msglen)))  { 
perror(  "unable  to  create  message  buffer"); 
return  -1; 

} 

/*  copy  messg  into  outgoing  buffer  */ 
bzero(messgbuf,  msglen);  /*  clear  buffer  */ 

bcopy((char  *)&messg,  messgbuf,  msglen);  /*  copy  messg  into  buffer  */ 
mp  =  messgbuf  +  sizeof(messg);  /*  skip  over  messg  */ 

/*  append  excl  &  subj  lists  and  data  */ 
copy _gidjist(messg.excludejist,  &mp); 

if  (debug)  {  /*  print  excljist  string  */ 

printf("excljist  to  send: "); 
for  (i=0;  i<ntohs(messg.excl_list_len);  i-H-)  { 
up  =  (u_short  *)(mp  +  i*2); 
printf("  %d ",  *up); 

} 

printft"\n"); 
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mp  +=  (ntohs(inessg.excl_Ust_len)*2);  /*  skip  2*  number  of  gids  */ 
copy jgidjist(messg.  subjectjist,  &mp); 

if  (debug)  {  /*  print  subjjist  string  */ 

printf("subj_list  to  send: "); 
for  (i=0;  i<ntohs(messg.subj_list_len);  i++)  { 
up  =  (u_short  *Xnip  i*2); 
printX"  %d  *up); 

} 

printfCNn"); 

} 

mp  +=  (ntohs(messg.subj_list_len)*2);  /*  skip  2*  number  of  gids  */ 
bcopy(messg.data,  mp,  ntohs(messg.dataJen)); 

if  (debug)  {  I*  create  temporary  data  string  to  print  messg.data  *! 
if  (!(datastr  =  (char*)  malloc  (ntohs(messg.data_Ien)+l))) 
peiTor("unabie  to  create  data  string"); 
bcopy(mp,  datastr,  ntohs(mes$g.  data  Jen)); 
datastr[ntohs(messg.  data  Jen)]  =  NULL;  /•  make  string  */ 

printf("data  to  send;  %s\n",  datastr); 
free(datastr); 

} 

if  ((sent  =  sendto(s,  messgbuf,  msglen,  0,  (struct  sockaddr  *)&to, 
sizeof(struct  sockaddr)))  !=  msglen) 
perror("error  in  message  length  sent"); 
free(messgbuf); 
return  sent; 

}  /*  send_messg  */ 


*  Receive  a  variable  length  message  from  either  the  ms  or  me  sockets. 

*  Use  selectO  to  receive  from  ready  socket  into  messgbuf  If  received 

*  from  ms,  then  messgbuf  contains  only  "messg"  and  can  be  transferred 
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*  If  received  from  me,  tlmi  messgbuf  has  the  sender's  address  at  the 

*  front  which  is  extracted  into  "from",  then  extract  "messg**. 

*  Note;  recv  messg  allocates  memory  for  the  received  gid  lists  and 

*  data.  Messg  is  returned  with  pointers  pointing  to  these  new  lists 

*  and  data.  The  lists  and  data  must  be  deallocated  when  no  longer 

*  needed,  and  before  a  new  message  is  formed.  Otherwise,  the  links  to 

*  the  memory  will  be  lost,  and  the  memory  cannot  be  recocered. 


int  recv_messg(ms,  me,  messg,  from,  timeout) 

int  ms,  me;  /*  socket  fd's  */ 

struct  message  *messg;  /*  to  hold  incoming  message  */ 

struct  sockaddr  in  *from;  /*  extract  sender's  address  */ 

struct  timeval  timeout;  /*  for  variable  timeout  *! 

{ 

char  messgbuf[MAXMSGLEN],  *mp;  I*  message  buffer  and  pointer  */ 
int  len  =  sizeof(*from); 

int  ready,  sent  =  0; 

fd_set  fdread;  /*  fd  mask  for  select()  */ 

char  *datastr,  *data;  /*  to  receive  messg.data  *! 

I*  initialize  for  reception  from  multiple  sockets  */ 

FD_ZERO(&fdread); 

FD_SET(ms,  &fdread); 

FD_SET(mc,  Afdread); 

/*  wait  until  either  socket  is  ready  to  be  read  */ 
if  ((ready  =  select(32,  &fdread,  0, 0,  &timeout))  <  0)  { 
perror("select  error"); 
return  -1; 

} 

if  (ready)  { 

bzero((char  *)messg,  sizeof(*messg)); 

if  (FD_ISSET(ms,  &fdread))  {  /*  received  from  ms  socket  */ 

printf("received  at  MS  socket\n"); 
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if  ((sent  =  recvfroin(ms,  messgbuf,  MAXMSGLEN,  0.  from,  &len))  <  0)  { 
perrorC error  in  message  rec'd**); 
return  -1; 

} 

else  I*  extract  message  from  messgbuf  *! 

mp  =  messgbuf;  /*  set  ptr  to  beginning  of  message  */ 

} 

if  (FD_ISSET(mc,  &fdread))  {  /•  received  from  me  socket  */ 

printf("received  at  MC  socketVn"); 

if  ((sent  =  recvfrom(mc,  messgbuf,  MAXMSGLEN,  0,  from,  &Ien))  <  0)  { 
perror("eiTor  in  message  rec'd"); 
return  - 1 ; 

} 

else  {  I*  extract  sender  address  &  message  from  messgbuf  *! 
bzero((char  •)from,  len); 
bcopy(messgbuf,  (char  *)from,  len); 
mp  =  messgbuf  +  len;  /*  set  ptr  to  beginning  of  message  *! 

} 

} 

I*  extract  messg,  exclude  &  subject  lists,  and  any  data  from  messgbuf*/ 
bcopy(mp,  (char  *)messg,  si2eof(*messg)); 
mp  +=  sizeof(*messg);  /*  skip  to  lists  */ 
if  ((len  =  extract  _gidjist(mp,  &(messg->exclude_list), 

ntohs(messg*>excl_listJen)))  !=  ntohs(messg->exclJistJen))  { 
printf("error  in  extracting  exclude  list:  len  =  %d\n",  len); 
return -1; 

} 

if  (debug)  printf("len  =  %d  gids  extracted\n",  len); 
mp  +=  (ntohs(messg->excl_list_len))*2;  /*  skip  to  end  of  list  */ 
if  (debug)  printf(”mp-messgbuf  =  %d\n",  mp-messgbuf); 
if  ((len  =  extract _gidjist(mp,  &(messg->subjectjist), 

ntohs(messg->subj_list_len)))  !=  ntohs(messg->subj_list_len))  { 
printf("error  in  extracting  subject  list;  len  =  %d\n",  len); 
return  -1; 
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) 

if  (debug)  printfClen  =  %d  gids  extracted\n”,  len), 
mp  +=  (ntohs(messg->subj_listJai))*2;  /*  skip  to  end  of  list  */ 
if  (debug)  printf("mp-messgbuf  =  %d\n",  mp-messgbuO; 
if(!(data  =  (char  *)  malioc  (ntohs(messg->dataJen))))  { 
perror("unabIe  to  allocate  memory  for  data"); 
return  -1; 

} 

/*  copy  received  data  into  messg.data  */ 
bcopy(mp,  data,  ntohs(messg->data_len)); 
messg->data  =  data; 

if  (debug)  printf("after  data;  mp-messgbuf  =  %d\n",mp-messgbuf); 

if  (d^ug)  {  /*  create  temporary  data  string  to  print  messg.data  */ 
printf(''messg->datajen  =  %d\n",  ntohs(messg->dataJen)); 
if  (!(datastr  =  (char*)  malioc  (ntohs(messg->dataJen)+l))) 
perror("unable  to  create  data  string"); 
bcopy(mp,  datastr,  ntohs(messg->data_len)); 
datastr[ntohs(messg->dataJen)]  =  NULL;  /*  make  string  */ 
printf("data  rec'd;  %s\n",  datastr); 
firee(datastr); 

} 

'  /*  ready  */ 
um  sent; 

}  /*  recv  messg  */ 


*  Set  timer  to  current  time  +  timeout  time  t_usec.  Converts  t  usec  to 

*  seconds  and  useconds,  and  adds  to  timer.tv  sec  &  timer.tv  usec, 

*  respectively.  If  useconds  exceed  1,000,000,  a  carry  to  seconds  is 

*  performed. 

* 

tnnf*tni***m*****************0********0****************************0***f 
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void  set_tifneout(tiiner,  t  usec) 

/*  timer  to  set  */ 

/*  timeout  period  in  usee.  */ 

/*  for  timing  */ 


struct  timeval  •timer; 
long  t_usec; 

{ 

struct  servent  tzp; 


if  (t_usec  —  MAXTIME)  {  /•  set  sec  &  usee  to  MAXTIME  */ 

timer->tv_sec  =  MAXTIME; 
timer->tv_usec  =  MAXTIME; 

} 

else  {  /*  set  timer  to  current  time  +  t_usec  */ 

if  (gettimeofday(timer,  &tzp)  !=  NULL)  { 
perrorC’unabie  to  gettimeofday"); 
exit(-I); 

} 

/•  add  t_usec  to  timer  */ 

timer->tv_sec  +=  t_usec  /  SEC;  /♦  add  seconds  */ 
timer->tv_usec  +=  t_usec  %  SEC;  /♦  add  useconds  */ 
if  (timer->tv_usec  >=  SEC)  {  /♦  carry  1  sec.  •/ 

timer->tv_usec  -=  SEC; 
timer->tv_sec +=  1; 

} 

} 

}  /*  set_timeout  */ 


^*i|i****4i«**«***«*«****«*«***4i4i***4i****«*«*********4i4i************«***** 

•  Check  if  timer  has  timed  out.  Returns  I  if  current  time  >  timer, 

*  0  otherwise. 

************4i**************4>*************«***********<»******4>*****4i**/ 

int  timed_out(timer) 

struct  timeval  timer;  /*  timeout  timeval  */ 
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{ 


struct  timeval  tp; 
struct  serventtzp; 


/*  for  time  stamps  *! 
/*  for  timing  */ 


if(gettimeofday(&tp,  &tzp)  !=  NULL)  { 
peiTorCunable  to  gettimeofday"); 
exit(-l); 

} 

return  ((tp.tv_sec  >  timer.tvsec)  || 

((tp.tvsec  ==  timer.tvsec)  &&  (tp.tvusec  >  timer.tvusec))); 
}  /*  timed_out  */ 


Z********************************************************************* 

*  Search  a  list  of  gid_entries  pointed  at  by  gid_list  for  "gid".  Return 

*  1  if  the  gid  is  found,  0  otherwise. 

int  search_^djist(gidjist,  gid) 

struct  gid_entry  *gidjist; 
u_short  gid; 

{ 

struct  gid_entry  *gp  =  gid Jist; 
int  found  =  0; 

while  (gp  &&  Ifound)  { 

found  =  (gid  =  gp->gid); 
gp  =  gp->next; 

} 

return  found; 

)  /*  search jgid_list  */ 
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*  Add  a  new  node  to  the  head  of  the  list  of  giu  .  ntries  pointed  at  by 

*  gidjist.  Return  1  if  successful,  0  if  utud>te  to  add  to  list. 

int  add _gid_entry(gid_list,  gid) 

struct  gid_entry  **gid_list;  /•  pointer  to  gidjist  pointer  */ 

ushort  gid; 

{ 

struct  gid_entry  *gp; 

if  (search^djist(*gid  list,  gid))  /*  duplicate  gid  found  in  list  */ 
return  0; 

/*  allocate  new  gid_entry  */ 

(-(gP  ~  (struct  gid_entry  *)  nialloc  (sizeof(struct  gid_entry))))  f 
perror(  "unable  to  create  new  gid_entry"); 
return  0; 

) 

/*  add  new  entry  to  head  of  gidjist  */ 
gp->gid  =  gid; 

if(!(*gidjist))  /*  if  empty  gidjist  */ 
gp->next  =  NULL; 

else  /*  nonempty  list.,  insert  at  head  */ 

gp->next  =  *gid  Jist; 

♦gidjist  =  gp; 
return  1; 

}  /♦  addjgid_entry  ♦/ 

f*m^*0******m***************ini*^^^%^**m***************t*************** 

*  Copy  the  ^ds  from  a  list  of  9d_entries  pointed  at  by  gidjist  into 

♦  a  buffer  of  characters.  Since  each  gid  is  u_short,  it  will  take  2 

♦  bytes.  Uses  pointer  math  to  increment  through  buffer  to  place  gids. 

♦  Returns  the  number  of  gids  copied  or  0  for  an  error. 


int  copy _^d_list(gid_Ust,  bufib*) 


staict  gid_entry  *gid_list; 
char  **buffer; 


struct  gid_entry  *gp  =  gidjist; 
char  *cp  =  *buffer; 


/*  pointo'  to  buffer  */ 


if  (debug)  printf("copy jgidjist.  cp-(*buffer)  =  %d\n",  cp-(*bufFer)); 
while  (gp)  {  /*  copy  gids  one  at  a  time  */ 

if  (debug)  printf("gp->gid  =  %d\n”,  gp->gid); 
bcopy((char  *)&(gp->ffd),  cp.  2); 
cp  +=  2;  /*  u_short  =  2  bytes  */ 

gp  =  gp->next; 

} 

if  (debug)  printf("(cp-(*buffer))/2  =  %d\n",  (cp-(*buffer))/2); 
return  (cp  -  (*bufFer))  /  2;  /•  number  of  gids  copied  V 
}  /*  copy _gid_list  */ 


^^Ht**$HHl**000**0t********************»*****0************************** 


*  Extract  gids  from  a  buffer  of  characters  into  a  list  of  gid  entries 

*  pointed  to  by  gidjist.  Each  gid  is  2  bytes  in  the  buffer.  Uses 
pointer  nutth  to  increment  through  buffer  to  place  ^ds. 

*  Returns  the  number  of  gids  extracted  or  0  for  an  error. 


*4i****«««**««4i««*4i«**«*****«********4i4i****«***«C********«****4>*******/ 


int  extract ^d  Jist(buffer,  gidjist,  list  Jen) 


char  ‘buffer; 

struct  gid_entry  “gidjist;  /•  pointer  to  gidjist  pointer  */ 

u_short  list  Jen; 

{ 

u_short  i  =  0;  /*  count  of  gids  */ 

u_short  gid; 
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*gid_list  =  NULL; 

while  (i  <  list_len)  {  /•  extract  gids  one  at  a  time  */ 
bcopy((bu£fer  +  (i*2)),  (char  *)&gid,  2); 
if  ('(add ^d_entry(gid_list.  gid))) 
return  0;  /*  unsuccessful  add  */ 

i-H-; 

} 

return  i;  /*  number  of  gids  extracted  */ 

}  /*  extractjgidjist  */ 

^********«*******«***4>4>*««***«****«**«»****************4i*i|i*********«*« 

*  Remove  all  gids  from  a  list  of  gid  entries  pointed  at  by  gid_list  and 

*  free  all  memory.  Uses  two  pointers,  ngp  and  cgp  to  walk  through  list 

*  and  free  each  entry. 

«4t*«*i|i*4i4t4i«***««4t4i1i4[«****«**«****4i»*******4i«***»*«**«*********»»4i*«**^ 

void  delete _gidjist(gidjist) 
struct  gid_entry  **gid_list; 

struct  gid_entry  *ngp,  *cgp  =  *gid_list; 

while  (cgp)  {  /*  current  gid  ptr  !=  NULL  */ 

ngp  =  cgp->next;  /*  get  next  entry  */ 
free(cgp);  /*  free  current  entry  *! 

cgp  =  ngp; 

} 

*gidjist  =  NULL; 

}  /*  delete jgidjist  */ 

y**««**«*****4i«****«*4i*****4>4i*«««*«******«*****«**4i**«4i***i|i«««*»*«*«4i« 

*  Print  message  fields. 


void  pnnt_inessg(messg) 
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struct  message  messg, 

{ 

char  *datastr;  /*  to  convert  data  to  a  string  */ 

printf("version;  %d\n",  ntohs(messg.vers)); 

printfi[''checksum;  "/od\n",  ntohl(messg.checksum)); 
printf("group_name:  %s\n",  messg.group  name); 
printfCauthentication;  %d\n",  ntohl(messg.tuithentication)); 
printf("group_view:  %d\n",  ntohs(messg.group_view)); 
printf("sender_gid;  %d\n",  ntohs(messg. sender _^d)); 
printf("subjectjgid;  %d\n",  ntohs(messg. subject jgid)); 
printf("subject_addr:  %d\n",  ntohl(messg.subject_addr)); 
printf("subject_rank:  %d\n",  ntohs(messg.subject_rank)); 
printf("msg_type: "); 
switch  (ntohs(messg.msg_type))  { 


/*  monitoring  message  types  */ 


case  QUERY; 

printfC 

QUERYNn"); 

break; 

case  REPLY ; 

printfC 

REPLYNn"); 

break; 

/*  mserver  INITIATE  message  types 

*1 

case  M_JOIN; 

printf(" 

M_JOIN\n"); 

break; 

case  MLEAVE: 

prints" 

M_LEAVE\n"); 

break; 

case  M_SPLIT : 

printf(" 

M_SPLIT\n'’); 

break; 

case  M_MERGE: 

printf(" 

M_MERGE\n’’); 

break; 

case  M_ADD_PARENT ; 

printfC 

M_ADD_PARENT\n"); 

break; 

case  M_DEL_PARENT: 

printfC 

M_DEL_PARENT\n"); 

break; 

case  M_STATS_S: 

printf(" 

M_STATS_S\n"); 

break; 

case  M_STATS_L; 

printf(" 

M_STATS_L\n"); 

break; 

case  MFAIL: 

printfC 

M_FAlL\n"); 

break; 

case  M_MULTI_FAIL: 

printfC 

M_MULTI_FAIL\n"); 

break; 

case  MCOORDFAIL: 

printfC’ 

M_COORD_FAIL\n'’); 

break; 
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/*  change  sequence  ntessage  types  */ 


case  ACK; 

printfl" 

ACK\n**); 

break; 

case  COMMIT : 

printfl" 

COMMIT\n**). 

break; 

case  WAIT 

printfC* 

WAIT\n**); 

break; 

case  MSG_QUERY 

printfC* 

MSG_OUERYNn**). 

break. 

case  INIT ; 

printf(** 

INI-Hn**); 

break; 

/*  external  physical  requests  to  core-set  */ 

case  M_JOIN_REQ; 

printfC* 

M_JOIN_RE0\n**); 

break. 

case  M_LEAVE_REQ; 

printff** 

M_LEAVE_REQ\n**); 

break. 

case  M_SPLIT_REQ: 

printf(*’ 

M_SPLIT_REQ\n**); 

break; 

case  M_MERGE_REQ; 

printfl  ** 

M_MERGE_REQ\n*'); 

break; 

case  M_ADD_PAR_REQ:  printfC* 

M_ADD_PAR_REQ\n**). 

break; 

case  M_DEL_PAR_REQ  printf(" 

M_DEL_PAR_REQ\n**); 

break; 

case  M_STATS_S_REQ 

printfC* 

M_STATS_S_REQ\n**); 

break; 

case  M_STATS_L_REQ: 

printfC* 

M_STATS_L_REQ\n**); 

break; 

/*  application  group  INITIATE  message  types  *! 

case  A_JOIN. 

printfl:** 

A_JOIN\n**); 

break; 

case  A_LEAVE: 

printfl:** 

A_LEAVE\n'*); 

break; 

case  A_SPLITQ: 

printfC* 

A_SPLIT\n'’); 

break; 

case  AMERGE. 

printfl:** 

A_MERGE\n'*); 

break; 

case  ASTATSS; 

printfl:** 

A_STATS_S\n**); 

break; 

case  ASTATSL: 

printfC* 

A_STATS_L\n**); 

break; 

case  SUBMIT; 

printf(** 

SUBMIT\n**); 

break. 

case  DIRECT: 

printfl:** 

DlRECT\n**); 

break; 

!*  application  group  request  message 

types  */ 

case  A_JOIN_REQ; 

printfl:** 

A_JOIN_REQ\n'*); 

break; 

case  A_LEAVE_REQ; 

printfl:** 

A_LEAVE_REQ\n*’); 

break; 

case  A_SPLIT_REQ: 

printf(*’ 

A_SPLIT_REQ\n'*); 

break; 

case  A_MERGE_REQ; 

printf(** 

A_MERGE_REQ\n'’); 

break; 

case  A_STATS_S_REQ: 

printfl:** 

A_STATS_S_REQ\n**); 

break; 

case  A_STATS_L_REQ: 

printfC* 

A_STATS_L_REQ\n**); 

break; 

!*  mcaster  message  types 

*/ 

caseJOIN_GROUP: 

printfl:*' 

JOIN_GROUP\n**); 

break; 

case  LEAVE_GROUP: 

printfC* 

LEAVE_GROUP\n''); 

break; 
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case  JOIN_ACK; 

printf(" 

JOIN_ACK\n"); 

break; 

case  DUPMEMBER; 

printR" 

DUP_MEMBER\n"); 

break; 

case  NEG_JOIN: 

printf(" 

NEG_JOIN\n"), 

break; 

case  LEAVEACK: 

printK" 

LEAVE_ACK\n"); 

break; 

case  NO_GROUP: 

printf(" 

NO_GROUP\n"); 

break; 

case  NO  MEMBER; 

printH" 

NO_MEMBER\n"), 

break; 

case  NEG_LEAVE: 

printf(" 

NEG_LEAVE\n"); 

break; 

default: 

printf(" 

%d\n",  ntohs(messg.msg_type)); 

} 

printf("exclude_list:  "); 

print^d_list(messg.exclude_list); 

printf(''excljist_len:  %d\n'',  ntohs(messg.excl_IistJen)); 

printfC'subjectJist:  "); 

print  _gid_list(messg.subjectjist); 

printf("subj_listjen;  %d\n'',  ntohs(messg.subjJistJen)); 
printf("data_len:  %d\n",  ntohs(niessg.dataJen)); 


/*  create  temporary  data  string  to  print  messg.data  */ 
if  (!(datastr  =  (char*)  malloc  (ntohs(messg.dataJen)+l))) 
perror("unable  to  create  data  string"); 
bcopy(messg.data,  datastr,  ntohs(messg.dataJen)); 
datastr[ntohs(messg.data_len)]  =  NULL;  /*  make  string  */ 
printf("data;  %s\n",  datastr); 

free(datastr); 


}  /*  print_messg  */ 


*  Print  in  addr  IP  addresss. 
void  print_in_addr(addr) 
struct  in  addr  *addr; 
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{ 

char  '‘ip  addr  =  (char*)inet_ntoa(*  addr), 
printfC'IP  address  =  %s\n",  ip  addr); 

}  /*  print  in  addr  */ 


*  Print  sockaddr  in  address  structure  info. 

***^nnf***********^nmini***********************<if*************^nnnf^^^****/ 

void  print_sock_addr(sin) 

struct  sockaddr  in  sin;/*  socket  address  structure  */ 

{ 

printf("family;  %d  \n",  ntohs(sin.sin_family)); 
printf("porf.  %d  \n",  ntohs(sin.sin_port)); 
print_in_addr(&sin.  sinaddr.  saddr); 

}  /*  print_sock_addr  */ 

*  Print  socket  info. 

void  print_sock_info(s,  sin) 

int  s;  /*  socket  fd  */ 

struct  sockaddr  in  sin;/*  socket  address  structure  */ 

{ 

int  len  =  sizeof(sin); 

if  (getsocknanie(s,  (struct  sockaddr  *)  &sin,  &Ien)  <  0)  { 
perror("can't  get  socket  info"), 
exit(l); 

} 

printf(" Socket  Info;  \n"); 
printf("socket;  %d  \n",  s); 
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print_sock_addr(sin); 

}  /*  print  sock  info  */ 

*  Print  hostent  structure  info. 

****************************  *******************/ 

void  print_hostent(hp) 

struct  hostent  "'hp; 

{ 

char  *af  =  hp->h_addrtype  =  2  ?  "AF_INET":  "non-AF_INET"; 

printf("Hostent  Info;  \n"); 

printf(''h_name;  %s\n",  hp->h_naine); 

printfl["h_aliases[0];  %s\n'',  hp->h_aliases[0]); 

printf("h_addrtype;  %s\n",  af); 

printf("hjength;  %d\n",  ntohs(hp->hJength)); 

printf("h_addr:  %s\n",  inet_ntoa(*(struct  in_addr*Xhp->h_addr))); 

printf("h_addrjist[0]:  %s\n",  inet_ntoa(*(struct  in_addr*)  (hp->h_addr_list[0]))); 

}  /*  print  hostent  */ 

yiti****««****«*4i**«4i4i**4[4i*4i4r4t***4i***iti**4i***:|i*****«**4i«****4i«*********** 

*  Print  all  gids  from  a  list  of  gid  entries  pointed  at  by  gid  list. 

«**«****4i****««4i«*«**«****it>***«*4i******«******>|i****4i*****4>*****<|i**4>**/ 

void  print_gidjist(gid_list) 

struct  gid_entry  *gid_list; 

{ 

struct  gid  entry  '"gp  =  gidjist; 

if  (!gp)  printf("  empty  list"); 
else  { 

while  (gp  !=  NULL)  { 
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printfC  %d ",  gp->gid); 

gp  =  gp->next;  /*  get  next  entry  */ 


} 

} 

printf("\n"); 

}  /*  print^idjist  */ 
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^*4i*4i4i*******«*******««*«4i**«*««*****«*«««**«**«***4i**««*«**««««««*«i«* 

*  MCASTER.C  ver  1 .0  Muhicast  Emulator 

*  Program  to  emulate  IP  multicast  in  a  unicast  environment. 

*  Uses  single  socket  for  send  &  receive,  with  the  IP  address  &  port 

*  the  ss  ne  as  would  be  used  for  an  IP  multicast  (port  =  MS  PORT). 

'*  Incoming  messages  are  of  "message"  format,  outgoing  unicast  messages 

*  are  also  of  "message”  format  (for  join  &  leave  ACKs  to  members). 

*  Outgoing  multicast  messages  have  the  original  sender's  sockaddr  in 

'*  prepended  to  the  message,  since  mcaster  overwrites  the  original 

*  sender's  address  with  its  own  and  the  recipients  have  no  other  way 

*  of  knowing  who  was  the  original  sender 

**  Note;  this  version  has  no  error  checking  or  diagnostic  print  state- 

*  ments ...  any  erroneous  message  is  simply  discarded  or  delivered  as 

’*  is.  For  diagnostics,  use  MCASTERV.C,  the  same  program  with 

*  diagnostic  print  statements. 

* 

*'  Written  by  Dave  Neely,  March  1994. 

*  Modified  4/26/94 

«***iti<«>****«*4>4>«**<l>«**<l>4>4>***<t>*>t>*4<******<t>****>t<************4>*******4>**4'4t/ 

#inciude  "msutil.c" 

struct  member  {  /*  element  in  list  of  members  */ 

struct  sockaddrin  addr; 
u  char  loop; 
struct  member  "‘next; 

}; 

struct  group  {  /*  element  in  list  of  groups  */ 

char  name[MAXGROUPNAME]; 
struct  group  "^next; 
struct  member  "‘members; 
struct  member  "‘last; 
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struct 

struct 

struct 

struct 

struct 

struct 


sockaddr  in  sin; 
sockaddrin  group_addr; 
sockaddr  in  from; 
sockaddr  in  member; 
hostent  *hp; 


/*  socket  address  */ 

/*  group  mcast  address  */ 
/*  received  from  address  */ 
/*  member  address  */ 

/*  host  entity  struct  */ 


group  *groupJist,  *lastjgroup.  /*  ^obal  group  list  ptrs  */ 


/*  functions  */ 

struct  group  *  search  jgroup_list(); 

struct  member  *search_memberjist(); 

struct  group  ^addjgroupO; 

int  add_member(); 

int  join  _jroup(); 

int  remove_group(); 

int  remove_member(); 

int  leave_^roup(); 

int  mcast(); 

void  print  jgroupJist(); 

void  print_memberjist(); 


main() 

{ 


int 

s; 

/*  IP  socket  fd  */ 

ushort 

port; 

int 

len; 

int 

sent; 

char 

hostname[MAXGROUPNAME]; 

char 

hostaddr[17]; 

char 

msgbuflMAXMSGLENJ; 

/•  to  recv  message  */ 

char 

*msgstr; 

/*  to  copy  message  */ 

short 

message_type,  msglen; 

short 

cc; 
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/*  initialize  socket  */ 
port  =  htons(MS_PORT); 

s  =  init_socket(sin,  port);  /*  mcaster  socket  •/ 

print  sock  info  (s,  sin); 

/*  get  info  about  local  host  */ 
gethostname(lK>stname,  MAXGROUPNAME); 
if  ((hp  =  gethostbyname(hostnanfie))  =  0)  { 
perrorCunabie  to  get  hostname"); 
exit(-l); 

} 

print_hostent(hp); 

strcpy(hostaddr,  inet_ntoa(*(struct  in_addr*Xhp->h_addr))); 

!*  initialize  group  address  structure  */ 
bzero((char *  )&group_addr,  sizeof(group_addr)); 
group_addr.sin_family  =  hp->h_addrtype; 
group_addr.sin  jpoit  =  htons(port); 
group_addr.sin_addr.s_addr  =  inet_addr(hostaddr); 
printf("Group  AddressAn"); 
print_sock_addr(group_addr); 

for  (;;)  {  /*wait  for  incoming  multicast  messages  *! 
len  =  sizeof(from); 

sent  =  recvfrom(s,  msgbuf,  MAXMSGLEN,  0,  &from,  &len); 

/*  extract  messg  from  buffer  */ 
bzero((char  *)&messg,  sizeof(messg)); 
bcopy(msgbuf,  (char  *)&mef<sg,  sizeof(messg)); 

/*  check  type  of  received  message  *! 
message_type  =  ntohs(messg.msg_type); 

if  ((message_type=JOrN_GROUP)||(message_type=LEAVE_GROUP))  { 
member  =  from; 

/*  all  members  receive  mcasts  on  the  MC  PORT  */ 


member.  sin_port  =  MCPORT; 
if  (messagejype  =  JOINGROUP) 

cc  =  join _^oup(messg.group_name.  member,  NO_LC)OP); 
else 

cc  =  leave _group(messg.group_name,  member); 

/*  generate  and  send  ACK  for  join  or  leave  *! 
form_messg(&messg,  messg.group_name,  0,  messg.group_view, 

0,  cc,  messg.  sender ^d,  0,  0,  0,  0.  0,  0); 
len  =  sizeof(messg); 
sendto(s,  (char  *)&messg,  len,  0, 

(struct  sockaddr  *)&from,  sizeof(struct  sockaddr)); 

> 

else  I*  multicast  unchanged  message  to  group  */ 
mcast(s,  msgbuf,  from), 

I 

}  /*  main  */ 

/«********«*««*«****«  «***4>*«*****4>***4i«*«*****  ************************ 

*  Search  group  list  for  a  group  by  its  string  name.  Return  a  pointer 

*  to  the  group  before  the  desired  group,  for  ease  of  removing  the 

*  group,  or  NULL  if  not  found. 

Struct  group  *search jgroup  jist  (groupname) 

char  *groupname; 

< 

struct  group  •gp  =  group  list; 
int  notfound; 

if  (groupjist)  {  /*  non-empty  group  list  */ 

if  (!  (notfound  =  strcmp(gp->name,  groupname)))  /*  found  1  st  one  */ 

gp  =  last ^oup;  /*  set  gp  to  element  before  1st  element  */ 

else  /•  not  the  1st  element  -  search  for  a  match  */ 
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I 


while  ((notfound  »  strcnip(gp->next->naiTie,  groupname))  &&. 
(gp->next  !=  iastjgroup)) 
gp  =  gp->iicxt; 
if(!notfound)  {  /*  found!  */ 

return  gp; 

} 

}  /*else  group  not  found  or  empty  group  list  */ 
return  NULL; 

}  /*search _group_Ust  */ 


*  Search  member  list  of  a  group  pointed  to  by  gp  for  member  "mbr". 

*  Return  a  pointer  to  the  member  before  the  desired  group,  for 

*  ease  of  removing  the  group,  or  NULL  if  not  found. 

Struct  member  *search_member_list  (gp,  mbr) 

struct  group  *gp;  /*  points  to  the  desired  group  */ 

struct  sockaddrjn  mbr;  /*  member  address  to  locate  •/ 

{ 

struct  member  *mp  =  gp->members; 
int  found; 

if  (gp->members)  {  /*  non-empty  member  list  */ 

if  (found  =  addrcmp(mp->addr,  mbr))  /*  found  1  st  one  */ 

mp  =  gp->last;  /*  set  mp  to  element  before  1  st  element  */ 

else  /*  not  the  1st  element  -  search  for  a  match  *1 

while  (( I  (found  =  addrcmp(mp->next->addr,  mbr))  && 

(mp->next  !=  gp->last))) 
mp  =  mp->next; 

if(found) 

return  mp; 

1  /*else  member  not  found  or  empty  list  */ 
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return  NULL; 

}  /*  search  member  list  */ 


y********************************************************************* 

*  Add  new  group  "groupname"  to  list  of  groups.  Return  pointer  to 
new  group. 

*«««*«*4i***«4i***4i***«*4i**«**«*««*«*********«*****«*****«*««*«**«**«**y 

Struct  group  *addjgroup  (groupname) 

char  ^groupname; 

{ 

struct  group  *gp; 

/*  create  new  group  element  */ 
if  (!(gp  =  (struct  group  *)  malloc  (sizeof(struct  group)))) 
return  NULL; 

/*  if  groupjist  is  empty  */ 

/*  non-empty  group  jist  */ 


/*  initialize  new  group  element  *! 
strcpy(gp->name,  groupname); 

gp->next  =  groupjist;  /*  point  new  last  element  to  1st  element  */ 

gp->members  =  NULL; 
gp->last  =  NULL; 


/•  connect  new  group  into  list  */ 
if  (Igroupjist) 

groupjist  =  gp; 
else 

last _^oup->next  =  gp; 
last  jgroup  =  gp; 


return  gp; 

}  /*  add _^oup  */ 


*  Add  new  member  to  member  list  of  group  pointed  to  by  gp.  Return 

*  0  if  successful  or  negative  value  indicating  reason  for  failure. 

*  mbr  is  a  sockaddr  in  structure  with  the  new  member's  address. 

*  loop  is  used  to  control  loopback  of  m^sage  to  sender, 

*  0  =  no  loopback,  1  =  loopback. 

4I***************************** «*««««**«*«*«**««*«*»«**«*« **«**«**» ***^ 

int  add_member  (gp,  mbr,  loop) 

struct  group  *gp; 
struct  sockaddr  in  mbr; 
u_char  loop; 

{ 

struct  member  *mp; 

/•  create  new  member  */ 

if  (!(mp  =  (struct  member  •)  malloc  (sizeo^struct  member)))) 
return  -3; 

/•add  to  list  */ 

if  (gp->members  =  NULL)  /*  if  member  list  is  empty  */ 

gp->members  =  mp; 

else  /*  non-anpty  groupjist  *! 

gp->last->next  =  mp;  /*  add  to  end  of  list  */ 
gp->last  =  mp;  /*  new  element  is  last  in  list  */ 


/*  initialize  new  member  *! 
mp->addr  =  mbr; 
mp->loop  =  loop; 

mp->next  =  gp->members;  /*  point  new  last  element  to  1st  element  */ 
return  0; 

}  /*  add_member  */ 


*************************************  ****************** 

*  Join  a  new  member  to  a  group  named  "groupname"  The  IP  address  of 

*  the  new  member  is  in  mbr,  a  sockaddr  in  struct.  If  the  group  exists, 

*  then  the  new  member  is  added  to  the  end  of  the  member  list.  If  the 

*  group  does  not  exist,  then  a  new  group  is  first  added  to  the  group 

*  list,  then  the  new  member  is  added  to  the  group  Loop  is  used  to 

*  control  loopback  of  messages  to  the  sender;  0  =  no  loopback,  I  = 

*  loopback. 

*^f****t **************************************************************/ 

int  join^group  (groupname,  mbr,  loop) 

char  *groupname; 
struct  sockaddr  in  mbr; 
uchar  loop; 

I 

struct  group  *gp; 

/*  check  if  group  exists  */ 

(KgP  search_group_list(groupname)))  {  I*  group  doesn't  exist  */ 
if  (!(gp  =  add_group(groupname)))  /*  so  add  a  new  group  */ 
return  NEGJOIN; 

} 

else  {  /*  group  exists  */ 

gp  =  gp->next;  /*  set  gp  to  desired  group  */ 

if  (search_member_list(gp,mbr))  /*  member  found  in  list  *! 

return  DUP_MEMBER; 

} 

/*  add  new  member  to  group  */ 
if  (add_member(gp,  mbr,  loop)  <  0) 
return  NEG_JOIN; 
return  JOINACK; 

}  /*  joinjgroup  *! 
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*  Remove  group  pointed  to  by  gp->next  from  group  list.  The  group  has 

*  had  all*  of  its  members  removed  and  is  now  ready  to  be  removed  from 

*  the  list.  Return  0  if  successful,  neg.  value  if  unsuccessful 

^i:m0^^Him*************************************************************/ 

int  remove _jgroup  (gp) 

struct  group  *gp;  /*  gp  points  to  group  prior  to  desired  group  *1 

{ 

struct  group  *rgp;  /*  group  to  be  removed  */ 

if  (group  list  =  NULL)  /*  empty  list  */ 

return  -6; 

if  (groupjist  ==  last _group)  {  /*  remove  only  member  */ 

free(group_list); 

groupjist  =  lastjgroup  =  NULL; 

else  {  /*  remove  desired  group  at  gp->next  */ 

rgp  =  gp->next;  /*  group  to  be  removed  */ 

gp->next  =  rgp->next; 

if  (groupjist  =  rgp)  /*  remove  first  group  */ 

groupjist  =  rgp->next; 

if  (last_group  =  rgp)  I*  remove  last  group  *1 

last  _group  =  gp; 
free(rgp); 

} 

return  0; 

}  /*  removejgroup  */ 
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*  Remove  a  member  pointed  to  by  mp->next  in  group  pointed  to  by  gp 

*  mp  points  to  member  prior  to  one  to  be  removed  Returns  0  on  success, 

*  neg.  value  on  failure,  and  I  if  list  is  empty. 

tt^m^lUl****************  ********************  *************************0*^ 

int  remove  member  (gp,  mp) 

struct  group  *gp; 
struct  member  *mp; 

{ 

int  cc; 

struct  member  *rmp; 

if  (gp->members  =  NULL)  /*  no  members  to  remove  */ 
return  -7; 

if  (gp->members  =  gp->last)  {  /*last  member  to  remove  */ 
free(gp->members); 
gp->members  =  gp>>la$t  =  NULL; 
cc=  1; 

} 

else  {  /*  remove  desired  member  at  mp->next  */ 

rmp  =  mp->next;  /*  member  to  be  removed  */ 

mp->next  =  rmp->next; 

if  (gp->members  ==  rmp)  /*  remove  ftrsi  member  */ 
gp->members  =  rmp->next; 

if  (gp->last  =  rmp)  /*  remove  last  member  */ 

gp->last  =  mp; 
freeCrmp); 
cc  =  0; 

} 

return  cc; 

}  /*  remove  member  */ 
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'*  Allows  a  member  "mbr"  of  a  group  to  leave  the  group  "groupname" 

*  If  the  member  was  the  last  one,  the  group  is  also  removed  from  the 

*  group  list.  Trying  to  remove  a  member  that  doesn't  exist,  or  a 

*  member  from  a  group  that  doesn't  exist,  return  error  codes. 

*  Successful  removal  of  a  member  returns  LEAVE  ACK  code. 

int  leave_group  (groupname,  mbr) 

char  ^groupname; 
struct  sockaddr  in  mbr; 

{ 

struct  group  “^gp,  "^dgp; 
struct  member  *mp; 
int  empty  =  0; 

/*  check  if  group  exists  */ 

(Hgp-  search _group_list(groupname)))  /*  group  doesn't  exist  */ 
return  NO_GROUP; 

/*  gp  points  to  group  prior  to  desired  group  */ 
dgp  =  gp->next;  /'*  set  dgp  to  desired  group  */ 

if  (!(mp  =  search_member_list(dgp,mbr)))  /*  member  not  found  *! 
return  NO_MEMBER; 

I*  mp  points  to  member  prior  to  desired  member  *! 
empty  =  remove_member(dgp,  mp); 

if  (empty)  remove ^oup(gp);  /*  remove  group  if  empty  member  list  *! 

return  LEAVE  ACK; 

}  /*  leave _^oup  */ 


Receives  "message"  and  iteratively  sends  it  to  all  members  of 
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*  the  group  "messg.group  name"  Combirjes  "message"  with  "from"  address 

*  of  sender  in  an  extended  format  message,  stored  in  messgbuf  The 

*  mcast  is  sent  to  the  MC  PORT  of  each  member  Loopback  of  message 

*  to  sender  is  controlled  by  a  comparison  of  the  sender's  address 

*  (from)  with  the  loop  field  of  each  destination  member  On  success, 

*  returns  a  count  of  the  number  of  destinations  sent  to,  on  failure 

*  returns  a  neg.  value. 

int  mcast(s,  message,  from) 
int  s; 

char  '^message; 
struct  sockaddr  in  from; 

{ 

char  *  messgbuf; 
struct  message  messg; 
int  len,  msglen; 
struct  group  "^gp; 
struct  member  *mp; 
int  count  =  0; 

/*  extract  messg  from  buffer  */ 
bzero((char  *)&messg,  sizeof(messg)); 
bcopy(message,  (char  '*)&messg,  sizeofi[ messg)); 

/*  form  extended  message  *! 

msglen  =  sizeof(messg)  +  (ntohs(messg.excl_list_len)  + 

ntohs(messg.subj_list_len))*2  +  ntohs(  messg.  datajen); 
len  =  msglen  +  sizeof(from), 

/*  allocate  space  for  whole  extended  message  */ 
if  (!(messgbuf  =  (char"^)  malloc  (len))) 
return  - 1 , 


/*  fd  for  mcast  socket  */ 
/*  message  to  send  */ 

I*  sender  of  mcast  *! 
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/*  copy  message  into  outgoing  buffer  *! 
bzero(messgbuf,  len); 

bcopy((char  *)&ffom,  messgbuf,  sizeof(ffom)); 
bcopy(message,  (messgbuf  +  sizeof(ffom)),  msglen); 

/*  find  group  */ 

if(!(gp  =  search^roupJist(messg.group_name)))  /*group  not  found  */ 
return  -1; 

else  {  /•  group  found.,  gp  points  to  group  prior  to  desired  one  */ 

gp  =  gp->next;  /*  get  desired  group  */ 

mp  =  gp->Iast;  /*  mp  =  tail  of  member  list  */ 

I*  set  from  port  to  MC_PORT  for  addrcmp  search  */ 
ffom.sin_port  =  MC_PORT; 
if  (mp  !=  NULL)  {  /*  non-empty  list  */ 

do  {  /*  send  to  all  */ 

mp  =  mp->next; 

/*  check  for  loopback  to  sender,  then  send  to  destination  */ 
if  (!((addrcmp(ffom,  mp->addr))  &&  (mp->loop  =  NO_L(X)P)))  { 
sendto(s,  messgbuf,  len,  0,  (struct  sockaddr  *)&(mp->addr), 
sizeofi struct  sockaddr)); 

count-H-; 

} 

}  while(mp  !=  gp->last); 

} 

ffee(messgbuf); 

return  count; 

} 

}  /*  mcast  *l 

*  Print  group  list. 

^H^■^1^*1H^1HH^^H^t****^^**1H^***************^^****^^**^^*********1^*^^***1^*******j 
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void  print^oupJistO 

{ 

stnict  group  *gp  =  last^oup; 
printf("Group_List;\n''); 

if  (gp)  /*  non-empty  group  list  */ 

do  { 

gp  =  gp.>next; 
printf("%s\n",gp->name); 

}  while  (gp  !=  last^oup); 
else  printf("Empty  group_list\n"); 

}  /*  print_group_list  */ 


y(*««*4>«**4<*********4i4i*******4>*4>***4>********<t>>l<****<t>*<l>********  ********** 

*  Print  member  list  of  a  group  pointed  to  by  gp. 

******************** **««*********************************************^ 
void  print_member_list  (gp) 

struct  group  *gp;  /*  points  to  the  desired  group  */ 

{ 

struct  member  *mp  =  gp->last; 

printf("Member_List  for  group  %s;\n",  gp->name); 
if  (mp)  t*  non-empty  member  list  *! 

do  { 

mp  =  mp->next; 
print_sock_addr(mp->addr); 
printf("loop  =  %d\n\n",  mp->loop); 

}  while  (mp  !=  gp->last); 
else  printf("Empty  memberjist\n"); 

}  /*  print  member  jist  */ 
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I!^itif*tn,*mm*********************************»*******************0****** 

*  MSERVER  C  ver  1.0 

*  Membership  Server  program. 

*  At  present,  includes: 

*  join  &  leave  multicast  group 

*  message  sending  &  receiving 

*  pairwise  monitoring 

*  working  on  change  processing  sequence 

*  Member  failures  are  logged  to  file  "failures" 

* 

*  Written  by  Dave  Neely,  April  1994. 

*  Modified:  4/25/94 

0***^HHl************************************************************0*/ 

#include  "msutil.c" 

struct  sockaddrjn  to,  from; 
struct  hostent  *hp; 
u_short  mygid,  cw,  ccw; 
struct  timeval  tp; 

struct  servent  tzp; 

struct  timeval  recv  timeout; 

struct  timeval  query  timeout; 

struct  timeval  reply_timeout; 

struct  timeval  messg_timeout; 

struct  timeval  ACK  timeout; 

FILE  *fp; 
int  MCASTER; 

main  (argc,  argv) 

int  argc; 
char  *argv[]; 

{ 
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/*  general  use  address  structures  */ 
/*  host  entity  struct  */ 

/*  mserver  group  IDs  */ 

/*  for  time  stamps  */ 

/*  for  timing  */ 

/*  selectO  receive  timeout  */ 

/*  timeout  for  monitoring  query  */ 
/*  timeout  for  monitoring  reply  */ 
/*  timeout  for  response  message  */ 
/*  timeout  for  ACK  message  */ 

/*  file  to  record  mserver  failures  */ 


u_short  message  type; 

u  short  GV,  gsize;  /*  group  view  no.  and  group  size  */ 

int  ten,  i,  cc; 

int  reed,  sent; 

int  retries  =  2;  /*  monitoring  retries  for  no  reply  */ 

char  groupname[MAXGROUPNAME]; 

char  hostname[MAXGROUPNAME]; 

char  IPaddr[16]; 

char  groupaddr[16]; 

struct  table_entry  core_table[MAXTBLSIZE];  /*  core-set  state  table  *! 
u  short  coordinator;  /*  for  change  processing  */ 

long  authentication  =  0x7fffffiF; 

struct  gid  entry  *excl_list,  •subjjist,  /*  lists  of  mserver  gids  */ 

u  short  excljist  len,  subjjistjen; 

if  (arge  !=  8)  { 

printf("usage;  mserver  groupname  groupIPaddr"); 
printft"  mygid  cw _gid  cw_addr  ccw_^id  ccw_addr\n"); 
exit(-l); 

} 

!****  Note;  no  error  checking  on  arguments  ****/ 

strcpy(groupname,  argv[l]); 

strcpy(groupaddr,  argv[2]); 

mygid  =  atoi(argv[3]); 

cw  =  atoi(argv[4]); 

ccw  =  atoi(argv[6]); 

/*get  info  about  local  host  */ 
gethostname(hostname,  MAXGROUPNAME); 
if  ((hp  =  gethostbyname(hostname))  =  0)  ^ 
perror(''unable  to  get  hostname"); 
exit(-l); 

} 

print_hostent(hp); 
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strcpyCIPaddr,  iiiet_ntoa(*(struct  in_addr*Xhp->h_addr))); 

/*  initialize  core_table  •/ 
bzero((char  *)core_table,  sizeoXcore_table)); 
core_table[mygid].addr  =  inet_addr(IPaddr); 
core_tabIe[inygid].cw  =  cw; 
core_table[mygid].ccw  =  ccw; 
core_table[cw].addr  =  inet_addr(argv[5]); 
core_table[ccw].addr  =  inet_addr(argv[7]); 
core_tabIe[cw].ccw  =  tnygid; 
core_tabIe(ccw].cw  =  mygid; 

I*  intialize  gid  lists  *! 
excljistjen  =  subjjistjen  =  0; 
excljist  =  subj  Jist  =  NULL; 

/*  determine  if  IP  multicast  or  MC  ASTER  will  be  used  */ 
#ilhdef  IFF_MULTIC  AST 
MC  ASTER  =  1; 

#e]se  I*  check  that  group  address  is  in  Class  D  range  */ 
if  ((inet_addr(groupaddr)  <  inet_addr("224.0.0.255"))  || 

(inet_addr(groupaddr)  >  inet_addr("239.255.255.255"))) 
MCASTER=  1; 

#endif 

printX"Mserver\n\n" ); 

printf("mygid:  %d,  cw;  %d,  ccw:  %d\n",  mygid, 
core_table[mygid].cw,  core_table[mygid].ccw); 
printfC'my "); 

print_in_addr(&(core_table[mygid] .  addr)); 
printf("cw "); 

print_in_addr(&(core_table[cw].addr)); 
printfCccw "); 

print_in_addr(&(core_table[ccw] .  addr)); 
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/*  initialize  general  purpose  "ms"  &  mcaster  "me"  sockets  */ 

ms  =  init_socket(sin,  htons(MS_PORT)); 

print  sock  info  (ms,  sin). 

me  =  init_socket(mcsin,  htons(MC_PORT)); 

print  sock  info  (me,  mesin); 

/*  initialize  timeouts  */ 

I*  reev  timeout  is  an  absolute  period,  not  referenced  to  current  time  */ 
recv  timeout.tv  sec  =  T  RECV  /  SEC;  /*  set  seconds  */ 
recv  timeout.tv  usec  =  T_RECV  %  SEC;  /*  set  useconds  *t 
set_timeout(&query_timeout,  T  QUERY);  /*  set  timer  for  next  query  */ 
set_timeout(&reply_timeout,  MAXTIME);  /*  reset  timer  for  reply  */ 
set_timeout(&messg_timeout,  MAXTIME);  /*  reset  messg  timer  */ 
set_timeout(&ACK_timeout,  MAXTIME);  /*  reset  ACK  timer  */ 

cc  =  join_mcast _group(groupname,  groupaddr); 
switch  (cc)  { 

case  JOIN_ACK ; 

printf("Group  %s  joinedAn",  groupname); 
break; 

case  DUP  MEMBER ; 

printf("Unable  to  join  group  %s;  duplicate  member\n",  groupname); 

exit(-l); 

break; 

case  NEG_JOIN  : 

printfl^"Unable  to  join  group  %s.\n",  groupname); 
exit(-l); 
break; 
default ; 

printf("Invalid  code  returned  during  group  joinAn"); 
exit(-l); 

} 


170 


for  (;;)  {  /*  b^n  main  loop  •/ 

Icn  =  sizeof(from); 

/*  check  if  message  ready  */ 

if  ((reed  =  recv_messg(ms,  me.  &messg.  Afrom,  reev  timeout))  >  0)  { 
printf("MESSAGE  RECEIVED  :\n"). 
printf("%d  bytes  received:\n".  reed); 
print_messg(messg); 
printf("from:\n"); 
print_sock_addr(from); 
message_type  =  ntohs(messg.msg_type); 

t*  select  appropriate  action  for  received  message  type  *! 
switch  (message_type)  ( 

/*  mserver  set  message  types  */ 
case  QUERY: 

/*  check  if  query  from  cw  neighbor  in  this  group  */ 
if  ((!(strcmp(messg.group_name,  groupname)))  && 
(from.sin_addr.s_addr  =  core_table[cw].addr))  { 

/*  then  send  reply  */ 

form_messg(&messg,  groupname, 0,0,  mygid,  REPLY, 
cw,0,0,0,0,0,0); 
len  =  sizeof(messg); 

if  ((sent  =  send_messg(ms,  messg,  from))  !=  len)  { 
printf("error  in  message  length  sent\n"); 

break; 

case  REPLY: 

/*  check  if  query  from  ccw  neighbor  in  this  group  */ 
if  ((!(strcmp(messg.group_name,  groupname)))  && 
(from.sin  addr.s  addr  =  core_table[ccw].addr))  { 

/*  then  reset  reply  and  query  timers,  and  #  retries  */ 
printf("REPLY  rec'd  from  %d,  resetting  timers\n". 
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ntohs(messg  sender_gid)); 
set  timeoutC&reply  timeout,  MAXTIME); 
set_timeout(&query_timeout.  T  QUERY); 
retries  =  2. 

} 

break; 

/•  mserver  INITIATE  message  types  */ 

case  M_JOIN:  printf("  M_JOIN\n");  break; 

case  M_LEAVE  printif  M_LEAVE\n'');  break; 

case  M_SPLIT:  printfl["  M_SPLIT\n");  break; 

case  M_MERGE:  printf("  M_MERGE\n");  break; 

case  M_ADD_PARENT:  printfl["  M_ADD_PARENTm");  break; 

case  M_DEL_PARENT  printf("  M_DEL_PARENmn’');  break; 

case  M_STATS_S;  printf("  M_STATS_S\n");  break; 

case  M_STATS_L:  printfC*  M_STATS_L\n");  break; 

caseM_FAIL;  printf('*  M_FAIL\n");  break; 

case  M_MULTI_FAIL;  printff  M_MULTI_FAIL\n");  break; 

case  M_COORD_F AIL  printfl["  M_COORD_FAIL\n");  break; 

I*  change  sequence  message  types  *! 

case  ACK:  printf("  ACK\n");  break; 

case  COMMIT;  printfl["  COMMITNn");  break; 

case  WAIT.  printfC’  WAIT\n");  break; 

case  MSG_QUERY;  printf(’’  MSG_QUERY\n");  break; 

caselNIT:  printf("  INIT\n");  break; 

/*  external  physical  requests  to  core-set  *! 
case  M_JOIN_REQ;  printfC  M_JOIN_REQ\n");  break; 

case  M_LEAVE_REQ;  printfC  M_LEAVE_REQ\n");  break; 

case  M_SPLIT_REQ;  printfC  M_SPLIT_REQ\n");  break; 

caseM_MERGE_REQ;  printfC  M_MERGE_REQ\n");  break; 

case  M_ADD_PAR_REQ.  printf("  M_ADD_PAR_REQ\n");break; 

case  M_DEL_PAR_REQ;  printf("  M_DEL_PAR_REQ\n");  break; 

case  M_STATS_S_REQ:  printf("  M_STATS_S_REQ\n");  break; 

case  M_STATS_L_REQ  printfC  M_STATS_L_REQ\n");  break; 

/*  application  group  INITIATE  message  types  •/ 
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case  A_JOIN; 

printfl[" 

AJOINNn"); 

break. 

case  ALEAVE; 

printf(" 

A_LEAVE\n"); 

break; 

case  ASPLITQ; 

printR- 

A_SPLIT\n"); 

break. 

case  AMERGE: 

printf(" 

A_MERGE\n"); 

break. 

case  ASTATSS: 

printf(" 

A_STATS_S\n"); 

break. 

case  ASTATSL: 

printK" 

A_STATS_L\n"); 

break; 

case  SUBMIT : 

printR" 

SUBMIT^"); 

break; 

case  DIRECT ; 

print!!" 

DIRECT\n"); 

break; 

/*  application  group  request  message 

types  */ 

case  A_JOIN_REQ: 

print!!" 

A_JOIN_REQ\n"); 

break; 

case  A_LEAVE_REQ; 

print!!" 

A_LEAVE_REQ\n"); 

break; 

case  A_SPLIT_REQ; 

printfl[" 

A_SPLIT_REQ\n"). 

break; 

case  AMERGEREQ; 

print!!" 

A_MERGE_REQ\n"); 

break; 

case  A_STATS_S_REQ: 

print!!" 

A_STATS_S_REO\n");  break; 

case  A_STATS_L_REQ 

print!!" 

A_STATS_L_REO\n");  break; 

/*  mcaster  message  types 

*t 

case  JOIN_GROUP: 

print!!" 

JOIN_GROUP\n"); 

break; 

case  LEAVE_GROUP: 

print!!" 

LEAVE_GROUP\n"); 

break; 

case  JOINACK: 

print!!" 

JOIN_ACK\n"); 

break; 

case  DUP_MEMBER: 

print!!" 

DUP_MEMBER\n"); 

break; 

case  NEGJOIN; 

print!!" 

NEGJOINXn"); 

break; 

case  LEAVEACK: 

print!!" 

LEAVE_ACK\n"); 

break; 

case  NO_GROUP: 

print!!" 

NO_GROUP\n"); 

break; 

case  NO_MEMBER: 

print!!" 

NO_MEMBER\n"); 

break; 

case  NEGLEAVE: 

print!!" 

NEG_LEAVE\n"); 

break; 

default; 

print!!" 

%d\n",  ntohs(messg.msg_type)) 

}  /*  switch  */ 

}  /*  if  (reed  >  0)  */ 
if  (reed  <  0) 

printf("error  in  message  rec'd\n"); 

/•  check  timeouts  */ 

if  (timed_out(query_timeout))  {  I*  time  to  send  a  new  query  */ 
/*  reset  QUERY  timer  */ 


173 


set_tiineout(&query_timeout,  T  QUERY), 

/*  set  REPLY  timer  V 
set_timeout(&repiy_timeout.  T  REPLY); 

form_messg(&messg,  groupname,0 ,0  .  mygid,  QUERY,  ccw,  0,0,0,0,0,0). 
len  =  sizeof(messg); 
to.sinfamily  =  AFINET; 
to.sinjjort  =  htons(MS_PORT); 
to.sin_addr.s_addr  =  core_table[ccw].addr; 
if  ((sent  =  send_messg(ms,  messg,  to))  !=  len)  { 
printf("error  in  message  length  sent\n”); 
exit(-l); 

} 

}  /*  query_timeout  */ 

if  (timed_out(reply_timeout))  {  I*  retry  or  note  failure  */ 

/*  reset  QUERY  timer  */ 
set_timeout(&query_timeout,  T  QUERY); 
if  ((retries—)  <  0)  {  /*  then  ccw  is  failed  */ 

retries  =  2;  /*  reset  retry  counter  */ 

/*  log  an  entry  in  failures  file  •/ 
gettimeofday(&tp,  &tzp); 
if  (fp  =  fopenC'failures",  "a"))  ( 

fprintf  (fp,  "Member  %d  is  detected  foiled  by  %d  at  %d  sec.\n\n", 

ccw,  mygid,  tp.tv  sec); 

fciose(fp); 

/*  At  this  point,  would  want  to  start  fail  processing  *! 
set_timeout(&reply_timeout,  MAXTIME);  /*  reset  reply  timer  */ 

} 

else  { 

/*  set  REPLY  timer  *! 
set_timeout(&reply_timeout,  TREPLY); 
form_messg(&messg,  groupname,0,0,  mygid,  QUERY, 
ccw,0,0,0,0.0,0); 
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